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1  Objectives 

Given  data  samples,  inferring  the  rules  underlying  behind  the  data  is  a  major  challenge 
in  the  area  of  machine  learning.  Such  a  learning  method  can  be  applied  to  solving  a  wide 
range  of  real-world  problems  such  as  robot  control,  bioinformatics,  brain  signal  analysis, 
computer  vision,  speech  recognition,  and  natural  language  processing.  For  this  reason,  a 
great  deal  of  effort  has  been  made  recently  to  develop  various  machine  learning  algorithms. 
Our  goal  is  to  propose  a  general  machine  learning  framework  that  can  be  employed  for 
improving  the  state-of-the-art  performance  in  these  application  domains. 

More  specifically,  we  establish  a  novel  approach  that  accommodates  various  challeng¬ 
ing  machine  learning  tasks  such  as  non-stationarity  adaptation,  outlier  detection,  dimen¬ 
sionality  reduction,  and  conditional  probability  estimation.  Our  key  idea  is  to  directly 
estimate  the  ratio  of  two  probability  densities  without  going  through  density  estimation, 
which  is  a  novel  paradigm  in  the  machine  learning  community.  Under  the  common  con¬ 
cept  of  direct  density-ratio  estimation,  we  develop  tailored  machine  learning  algorithms 
for  each  task  and  show  their  usefulness  in  various  application  domains. 


2  Status  of  effort 

My  project  consists  of  four  layers:  (A)  theory  of  density  ratio  estimation,  (B)  algorithms 
of  density  ratio  estimation,  (C)  machine  learning  algorithms  based  on  density  ratio  esti¬ 
mation,  and  (D)  real-world  application  of  density  ratio  estimation. 

For  the  layer  (B),  we  developed  a  new  method  of  direct  density-ratio  estimation  that  is 
significantly  more  accurate  than  naively  estimating  the  ratio  via  density  estimation.  We 
also  developed  density-ratio  estimation  algorithms  that  can  handle  correlated  and  rank- 
deficient  data.  We  further  proposed  to  combine  density  ratio  estimation  with  dimension¬ 
ality  reduction,  which  improves  the  estimation  accuracy  in  high-dimensional  problems. 

For  the  layer  (A),  we  elucidated  statistical  properties  of  density  ratio  estimators  for 
parametric  and  non-parametric  cases.  Such  results  are  expected  to  contribute  to  further 
improving  the  estimation  accuracy  and  computational  efficiency  of  density  ratio  estima¬ 
tors. 

For  the  layer  (C),  we  employed  our  density-ratio  estimation  methods  for  designing 
practical  machine  learning  algorithms  including  non-stationarity  adaptation,  outlier  de¬ 
tection,  supervised  dimensionality  reduction,  causal  direction  inference,  independent  com¬ 
ponent  analysis,  conditional  density  estimation,  probabilistic  classification,  and  multi-task 
classification. 

Finally,  for  the  layer  (D),  we  demonstrated  the  usefulness  of  the  above  machine  learn¬ 
ing  algorithms  in  several  real-world  applications  such  as  brain-computer  interface,  robot 
control,  speech  and  audio  recognition,  image  processing,  and  sensor  data  analysis. 
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3  Abstract 

The  basis  of  our  project  is  a  method  to  accurately  estimate  the  ratio  of  probability  densi¬ 
ties.  First,  we  developed  a  new  method  of  density-ratio  estimation  that  can  avoid  density 
estimation  (publication  1).  This  method  gives  the  solution  analytically  just  by  solving  a 
system  of  linear  equations,  so  it  can  be  applied  to  large-scale  problems.  Furthermore,  we 
developed  methods  of  direct  density-ratio  estimation  suitable  for  highly-correlated  data 
(publication  2)  and  rank-deficient  data  (publication  3).  We  also  proposed  two  methods  for 
handling  high-dimensional  data:  The  first  method  is  heuristic  but  computationally  effi¬ 
cient  (publication  4),  and  the  other  method  is  theoretically  justifiable  but  computationally 
expensive  (publication  5).  We  have  also  theoretically  investigated  statistical  properties  of 
density  ratio  estimators  for  parametric  models  (publication  6)  and  non-parametric  models 
(publication  7). 

Then  we  designed  practical  machine  learning  algorithms  based  on  density  ratio  estima¬ 
tors.  This  includes  outlier  detection  (publication  8)  supervised  dimensionality  reduction 
(publication  9),  causal  direction  inference  (publication  10),  independent  component  anal¬ 
ysis  (publication  11),  conditional  density  estimation  (publication  12),  probabilistic  classi¬ 
fication  (publication  13),  and  its  multi-task  version  (publication  14).  Through  extensive 
experiments,  these  methods  were  shown  to  compare  favorably  with  existing  approaches 
in  terms  of  accuracy  and/or  computational  efficiency. 

We  also  explored  various  real-world  applications  using  the  density  ratio  approaches. 
This  includes  speaker  identification  (publication  15),  audio  tagging  (publication  16),  non- 
stationarity  adaptation  in  brain-computer  interface  (publication  17),  efficient  sample  reuse 
in  robot  control  (publications  18  and  19),  active  exploration  in  robot  control  (publication 
20),  feature  selection  in  robot  control  (publication  21),  adaptation  of  lighting-condition 
change  in  face-based  age  recognition  (publication  22),  detection  of  regions  of  interest  in 
images  (publication  23),  and  multi-user  adaptation  in  accelerometer-based  human  activity 
recognition  (publication  24). 

Finally,  the  framework  of  density  ratio  estimation  was  published  as  review  articles 
(publication  25  and  26).  We  will  also  publish  a  book  from  the  MIT  Press  on  our  density- 
ratio-based  non-stationarity  adaptation  approach  (publication  27;  approx.  300  pages,  the 
final  version  was  already  sent  to  the  publisher).  Furthermore,  the  entire  framework  of  the 
density  ratio  estimation  will  be  published  as  a  book  from  the  Cambridge  University  Press 
(publication  28;  beyond  400  pages,  90%  of  the  material  was  already  prepared). 

4  Personnel  Supported 

The  research  activity  of  the  following  people  was  supported. 

•  Masashi  Sugiyama  (Tokyo  Institute  of  Technology), 

•  Hirotaka  Hachiya  (Tokyo  Institute  of  Technology), 

•  Makoto  Yamada  (Tokyo  Institute  of  Technology), 


3 


•  Gang  Niu  (Tokyo  Institute  of  Technology), 

•  Takayuki  Akiyama  (Tokyo  Institute  of  Technology), 

•  Pui-Ling  Chui  (Tokyo  Institute  of  Technology). 


5  Publications 

During  the  24  months,  the  following  papers  were  published.  The  papers  indicated  by 
were  attached  to  this  report,  and  all  the  publications  are  available  from  http:// 
sugiyama-www . cs . titech. ac . jp/~sugi/publications .htral. 

1.  *  Kanamori,  T.,  Hido,  S.,  &  Sugiyama,  M.  A  least-squares  approach  to  direct  im¬ 
portance  estimation.  Journal  of  Machine  Learning  Research,  vol.10  (Jul.),  pp.1391- 
1445,  2009. 

2.  Yamada,  M.  &  Sugiyama,  M.  Direct  importance  estimation  with  Gaussian  mix¬ 
ture  models.  IEICE  Transactions  on  Information  and  Systems,  vol.E92-D,  no.  10, 
pp. 2159-2162,  2009. 

3.  Yamada,  M.,  Sugiyama,  M.,  Wichern,  G.,  &  Simrn,  J.  Direct  importance  estimation 
with  a  mixture  of  probabilistic  principal  component  analyzers.  IEICE  Transactions 
on  Information  and  Systems,  vol.E93-D,  no.  10,  pp. 2846-2849,  2010. 

4.  Sugiyama,  M.,  Kawanabe,  M.,  &  Chui,  P.  L.  Dimensionality  reduction  for  density 
ratio  estimation  in  high-dimensional  spaces.  Neural  Networks,  vol.23,  no.l,  pp.44- 
59,  2010. 

5.  *  Sugiyama,  M.,  Yamada,  M.,  von  Buiiau,  P.,  Suzuki,  T.,  Kanamori,  T.,  &  Kawan¬ 
abe,  M.  Direct  density-ratio  estimation  with  dimensionality  reduction  via  least- 
squares  hetero-distributional  subspace  search.  Neural  Networks,  vol.24,  no. 2,  pp.183- 
198,  2011. 

6.  *  Kanamori,  T.,  Suzuki,  T.,  &  Sugiyama,  M.  Theoretical  analysis  of  density  ratio 
estimation.  IEICE  Transactions  on  Fundamentals  of  Electronics,  Communications 
and  Computer  Sciences,  vol.E93-A,  no. 4,  pp. 787-798,  2010. 

7.  Suzuki,  T.,  Sugiyama,  M.,  &  Tanaka,  T.  Mutual  information  approximation  via 
maximum  likelihood  estimation  of  density  ratio.  In  Proceedings  of  2009  IEEE  Inter¬ 
national  Symposium  on  Information  Theory  (ISIT2009),  pp. 463-467,  Seoul,  Korea, 
Jun.  28-Jul.  3,  2009. 

8.  Hido,  S.,  Tsuboi,  Y.,  Kashima,  H.,  Sugiyama,  M.,  &  Kanamori,  T.  Statistical  out¬ 
lier  detection  using  direct  density  ratio  estimation.  Knowledge  and  Information 
Systems,  vol.26,  no. 2,  pp. 309-336,  2011. 
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9.  Suzuki,  T.  &  Sugiyama,  M.  Sufficient  dimension  reduction  via  squared-loss  mutual 
information  estimation.  In  Proceedings  of  Thirteenth  International  Conference  on 
Artificial  Intelligence  and  Statistics  (AISTATS2010),  JMLR  Workshop  and  Confer¬ 
ence  Proceedings,  vol.9,  pp. 804-811,  Sardinia,  Italy,  May  13-15,  2010. 

10.  Yamada,  M.  &  Sugiyama,  M.  Dependence  minimizing  regression  with  model  selec¬ 
tion  for  non-linear  causal  inference  under  non-Gaussian  noise.  In  Proceedings  of  the 
Twenty- Fourth  AAAI  Conference  on  Artificial  Intelligence  (AAAI2010),  pp.643- 
648,  Atlanta,  Georgia,  USA,  Jul.  11-15,  2010. 

11.  *  Suzuki,  T.  &  Sugiyama,  M.  Least-squares  independent  component  analysis.  Neural 
Computation,  vol.23,  no.l,  pp. 284-301,  2011. 

12.  Sugiyama,  M.,  Takeuchi,  I.,  Kanamori,  T.,  Suzuki,  T.,  Hachiya,  H.,  &  Okanohara, 
D.  Least-squares  conditional  density  estimation.  IEICE  Transactions  on  Informa¬ 
tion  and  Systems,  vol.E93-D,  no. 3,  pp. 583-594,  2010. 

13.  *  Sugiyama,  M.  Superfast-trainable  multi-class  probabilistic  classifier  by  least-squares 
posterior  fitting.  IEICE  Transactions  on  Information  and  Systems,  vol.E93-D,  no.  10, 
pp. 2690-2701,  2010. 

14.  Simrn,  J.,  Sugiyama,  M.,  &  Kato,  T.  Computationally  efficient  multi-task  learning 
with  least-squares  probabilistic  classifiers.  IPSJ  Transactions  on  Computer  Vision 
and  Applications,  vol.3,  pp.1-8,  2011. 

15.  Yamada,  M.,  Sugiyama,  M.,  &  Matsui,  T.  Semi-supervised  speaker  identification 
under  covariate  shift.  Signal  Processing,  vol.90,  no. 8,  pp. 2353-2361,  2010. 

16.  Wichern,  G.,  Yamada,  M.,  Thornburg,  H.,  Sugiyama,  M.,  &  Spanias,  A.  Automatic 
audio  tagging  using  covariate  shift  adaptation.  To  appear  in  Proceedings  of  IEEE  In¬ 
ternational  Conference  on  Acoustics,  Speech,  and  Signal  Processing  (ICASSP2010), 
Dallas,  Texas,  USA,  Mar.  14-19,  2010. 

17.  *  Li,  Y.,  Kambara,  H.,  Koike,  Y.,  &  Sugiyama,  M.  Application  of  covariate  shift 
adaptation  techniques  in  brain  computer  interfaces.  IEEE  Transactions  on  Biomed¬ 
ical  Engineering,  vol.57,  no. 6,  pp.  1318-1324,  2010. 

18.  Hachiya,  H.,  Akiyama,  T.,  Sugiyama,  M.,  &  Peters,  J.  Adaptive  importance  sam¬ 
pling  for  value  function  approximation  in  off-policy  reinforcement  learning.  Neural 
Networks,  vol.22,  no.10,  pp. 1399-1410,  2009. 

19.  Hachiya,  H.,  Peters,  J.,  &  Sugiyama,  M.  Efficient  sample  reuse  in  EM-based  policy 
search.  In  Proceedings  of  European  Conference  on  Machine  Learning  and  Principles 
and  Practice  of  Knowledge  Discovery  in  Databases  (ECML-PKDD2009),  pp.469- 
484,  Bled,  Slovenia,  Sep.  7-11,  2009. 
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20.  *  Akiyama,  T.,  Hachiya,  H.,  &  Sugiyama,  M.  Efficient  exploration  through  active 
learning  for  value  function  approximation  in  reinforcement  learning.  Neural  Net¬ 
works,  vol.23,  no.5,  pp. 639-648,  2010. 

21.  Hachiya,  H.  &  Sugiyama,  M.  Feature  selection  for  reinforcement  learning:  Evaluat¬ 
ing  implicit  state-reward  dependency  via  conditional  mutual  information.  In  Pro¬ 
ceedings  of  European  Conference  on  Machine  Learning  and  Principles  and  Practice 
of  Knowledge  Discovery  in  Databases  (ECML-PKDD2010),  pp. 474-489,  Barcelona, 
Spain,  Sep.  20-24,  2010. 

22.  *  Ueki,  K.,  Sugiyama,  M.,  &  Ihara,  Y.  Lighting  condition  adaptation  for  perceived 
age  estimation.  IEICE  Transactions  on  Information  and  Systems,  to  appear. 

23.  Yamanaka,  M.,  Matsugu,  M.  &  Sugiyama,  M.  Automatic  detection  of  regions  of 
interest  using  multiple  visual  saliency  measures  based  on  density  ratio  estimation.  In 
Proceedings  of  Vision  Engineering  Workshop  2010  (ViEW2010),  pp.7-8,  Yokohama, 
Japan,  Dec.  9-10,  2010. 

24.  Hachiya,  H.,  Sugiyama,  M.  &  Ueda,  N.  Coping  with  new  user  problems:  Trans¬ 
fer  learning  in  accelerometer-based  human  activity  recognition.  NIPS  2010  Work¬ 
shop  on  Transfer  Learning  by  Learning  Rich  Generative  Models,  Whistler,  British 
Columbia,  Canada,  Dec.  11,  2010. 

25.  Sugiyama,  M.,  Kanamori,  T.,  Suzuki,  T.,  Hido,  S.,  Sese,  J.,  Takeuchi,  I.,  &  Wang, 
L.  A  density-ratio  framework  for  statistical  data  processing.  IPSJ  Transactions  on 
Computer  Vision  and  Applications,  vol.l,  pp. 183-208,  2009. 

26.  Sugiyama,  M.  A  new  approach  to  machine  learning  based  on  density  ratios.  Pro¬ 
ceedings  of  the  Institute  of  Statistical  Mathematics,  vol.58,  no. 2,  pp.  141-155,  2010. 

27.  Sugiyama,  M.  &  Kawanabe,  M.  Covariate  Shift  Adaptation:  Towards  Machine 
Learning  under  Non-Stationary  Environment,  MIT  Press,  Cambridge,  MA,  USA, 
to  appear. 

28.  Sugiyama,  M.,  Suzuki,  T.,  &  Kanamori,  T.  Density  Ratio  Estimation  in  Machine 
Learning:  A  Versatile  Tool  for  Statistical  Data  Processing,  Cambridge  LIniversity 
Press,  Cambridge,  UK,  in  preparation. 

6  Interactions 

On  Jun.  25,  2009  (at  the  AOARD  Roppongi  office),  Nov.  3,  2009  (at  the  ACML  confer¬ 
ence),  and  Dec.  21,  2010  (at  my  office  at  Tokyo  Tech),  I  had  technical  discussions  with  my 
program  manager,  Dr.  Hiroshi  Motoda,  and  received  detailed  comments  and  suggestions. 

Below  is  the  list  of  my  presentations  related  to  the  project. 

1.  Dec.  22,  2010.  Toshiba,  Kawasaki,  Japan. 
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2.  Dec.  1,  2010.  Fujitsu,  Kawasaki,  Japan. 

3.  Nov.  30,  2010.  GCOE  Symposium  at  Tokyo  Institute  of  Technology,  Tokyo,  Japan. 

4.  Nov.  25,  2010.  LEAP  Symposium  at  Tokyo  Institute  of  Technology,  Tokyo,  Japan. 

5.  Nov.  24,  2010.  Kyoto  University,  Kyoto,  Japan. 

6.  Oct.  12,  2010.  Sony,  Tokyo,  Japan. 

7.  Oct.  4,  2010.  Google,  Tokyo,  Japan. 

8.  Sep.  20,  2010.  Yahoo,  Barcelona,  Spain. 

9.  Sep.  14,  2010.  Weierstrass  Institute  for  Applied  Analysis  and  Stochastics,  Berlin, 
Germany. 

10.  Sep.  13,  2010.  Technical  University  Berlin,  Berlin,  Germany. 

11.  Aug.  3,  2010.  Aichi  Science  and  Technology  Foundation,  Nagoya,  Japan. 

12.  Aug.  2,  2010.  Nagoya  Institute  of  Technology,  Nagoya,  Japan. 

13.  Jul.  20,  2010.  NEC,  Kawasaki,  Japan. 

14.  Jul.  15,  2010.  Georgia  Institute  of  Technology,  Atlanta,  GA,  USA. 

15.  Jun.  14,  2010.  Institute  of  Electronics,  Information  and  Communication  Engineers, 
Tokyo,  Japan. 

16.  Jun.  7,  2010.  Institute  of  Systems,  Control  and  Information  Engineers,  Osaka, 
Japan. 

17.  May  25,  2010.  The  Society  of  Instrument  and  Control  Engineers,  Tokyo,  Japan. 

18.  Apr.  26,  2010.  Carnegie  Mellon  LIniversity,  Pittsburgh,  PA,  USA. 

19.  Apr.  21,  2010.  NEC  Soft,  Tokyo,  Japan. 

20.  Mar.  8,  2010.  Kyoto  University,  Kyoto,  Japan. 

21.  Dec.  21,  2009.  Science  Council  of  Japan,  Tokyo,  Japan. 

22.  Dec.  3,  2009.  GCOE  Symposium  at  Tokyo  Institute  of  Technology,  Tokyo,  Japan. 

23.  Nov.  27,  2009.  National  Institute  of  Information  and  Communications  Technology, 
Kyoto,  Japan. 

24.  Nov.  17,  2009.  NECsoft,  Tokyo,  Japan. 
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25.  Nov.  3,  2009.  1st  Asian  Conference  on  Machine  Learning  (ACML2009),  Nanjing, 
China. 

26.  Oct.  16,  2009.  NLP  workshop,  Tokyo,  Japan. 

27.  Jnn.  25,  2009.  Asian  Office  of  Aerospace  Research  &  Development,  Tokyo,  Japan. 

28.  May.  21,  2009.  NTT  Communication  Science  Laboratories,  Kanagawa,  Japan. 

7  Inventions 


None. 


8  Honors  /  Award 

1  received  two  awards  related  to  the  current  project. 

1.  May  13,  2010.  Incentive  Award,  The  Institute  of  Electronics,  Information  and 
Communication  Engineers  (IEICE),  PRMLI  Technical  Group. 

2.  Jnn.  10,  2010.  Incentive  Award,  Japanese  Society  for  Artificial  Intelligence  (JSAI) 
SIG-DMSM 

The  award  1  was  given  for  a  conference  version  of  the  publication  20,  while  the  award 

2  was  given  for  a  conference  version  of  the  publication  13. 

9  Archival  Documentation 

Selected  journal  articles  (1,  5,  6,  11,  13,  17,  20,  and  22)  are  attached  as  archival  documen¬ 
tation.  All  the  publications  listed  in  Section  5  are  available  from  http://sugiyama-www. 
cs . titech. ac . jp/~sugi/publications . html. 


10  Software 

Implementation  of  various  density-ratio  methods  (mostly  in  MATLAB)  is  available  from 
my  web  page:  http : //sugiyama-www. cs .titech. ac . jp/~sugi/ software/index . html. 
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Abstract 

We  address  the  problem  of  estimating  the  ratio  of  two  probability  density  functions,  which  is  often 
referred  to  as  the  importance.  The  importance  values  can  be  used  for  various  succeeding  tasks 
such  as  covariate  shift  adaptation  or  outlier  detection.  In  this  paper,  we  propose  a  new  importance 
estimation  method  that  has  a  closed-form  solution;  the  leave-one-out  cross-validation  score  can  also 
be  computed  analytically.  Therefore,  the  proposed  method  is  computationally  highly  efficient  and 
simple  to  implement.  We  also  elucidate  theoretical  properties  of  the  proposed  method  such  as  the 
convergence  rate  and  approximation  error  bounds.  Numerical  experiments  show  that  the  proposed 
method  is  comparable  to  the  best  existing  method  in  accuracy,  while  it  is  computationally  more 
efficient  than  competing  approaches. 

Keywords:  importance  sampling,  covariate  shift  adaptation,  novelty  detection,  regularization 
path,  leave-one -out  cross  validation 


1.  Introduction 

In  the  context  of  importance  sampling  (Fishman,  1996),  the  ratio  of  two  probability  density  func¬ 
tions  is  called  the  importance.  The  problem  of  estimating  the  importance  is  attracting  a  great  deal  of 
attention  these  days  since  the  importance  can  be  used  for  various  succeeding  tasks  such  as  covariate 
shift  adaptation  or  outlier  detection. 

Covariate  Shift  Adaptation:  Co  variate  shift  is  a  situation  in  supervised  learning  where 
the  distributions  of  inputs  change  between  the  training  and  test  phases  but  the  con¬ 
ditional  distribution  of  outputs  given  inputs  remains  unchanged  (Shimodaira,  2000; 
Quinonero-Candela  et  ah,  2008).  Co  variate  shift  is  conceivable  in  many  real-world 

*.  A  MATLAB®  or  R  implementation  of  the  proposed  importance  estimation  algo¬ 
rithm.  unconstrained  Least-Squares  Importance  Fitting  (uLSIF),  is  available  from 

http: //sugiyama-www. cs . titech  .ac.jp/ ~sugi/ software /uLSIF/. 
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applications  such  as  bioinformatics  (Baldi  and  Brunak,  1998;  Borgwardt  et  al.,  2006), 
brain-computer  interfaces  (Wolpaw  et  al.,  2002;  Sugiyama  et  al.,  2007),  robot  control 
(Sutton  and  Barto,  1998;  Hachiya  et  al.,  2008),  spam  filtering  (Bickel  and  Scheffer, 

2007),  and  econometrics  (Heckman,  1979).  Under  covariate  shift,  standard  learning 
techniques  such  as  maximum  likelihood  estimation  or  cross-validation  arc  biased  and 
therefore  unreliable — the  bias  caused  by  covariate  shift  can  be  compensated  by  weight¬ 
ing  the  loss  function  according  to  the  importance  (Shimodaira,  2000;  Zadrozny,  2004; 
Sugiyama  and  Muller,  2005;  Sugiyama  et  al.,  2007;  Huang  et  ah,  2007;  Bickel  et  al., 

2007). 

Outlier  Detection:  The  outlier  detection  task  addressed  here  is  to  identify  irregular 
samples  in  a  validation  data  set  based  on  a  model  data  set  that  only  contains  regular 
samples  (Scholkopf  et  ah,  2001;  Tax  and  Duin,  2004;  Hodge  and  Austin,  2004;  Hido 
et  al.,  2008).  The  values  of  the  importance  for  regular  samples  are  close  to  one,  while 
those  for  outliers  tend  to  be  significantly  deviated  from  one.  Thus  the  values  of  the 
importance  could  be  used  as  an  index  of  the  degree  of  outlyingness. 

Below,  we  refer  to  the  two  sets  of  samples  as  the  training  set  and  the  test  set. 

A  naive  approach  to  estimating  the  importance  is  to  first  estimate  the  training  and  test  density 
functions  from  the  sets  of  training  and  test  samples  separately,  and  then  take  the  ratio  of  the  esti¬ 
mated  densities.  However,  density  estimation  is  known  to  be  a  hard  problem  particularly  in  high¬ 
dimensional  cases  if  we  do  not  have  simple  and  good  parametric  density  models  (Vapnik,  1998; 
Hardle  et  ah,  2004).  In  practice,  such  an  appropriate  parametric  model  may  not  be  available  and 
therefore  this  naive  approach  is  not  so  effective. 

To  cope  with  this  problem,  direct  importance  estimation  methods  which  do  not  involve  den¬ 
sity  estimation  have  been  developed  recently.  The  kernel  mean  matching  (KMM)  method  (Huang 
et  ah,  2007)  directly  gives  estimates  of  the  importance  at  the  training  inputs  by  matching  the  two 
distributions  efficiently  based  on  a  special  property  of  universal  reproducing  kernel  Hilbert  spaces 
(Steinwart,  2001).  The  optimization  problem  involved  in  KMM  is  a  convex  quadratic  program,  so 
the  unique  global  optimal  solution  can  be  obtained  using  a  standard  optimization  software.  How¬ 
ever,  the  performance  of  KMM  depends  on  the  choice  of  tuning  parameters  such  as  the  kernel  pa¬ 
rameter  and  the  regularization  parameter.  For  the  kernel  parameter,  a  popular  heuristic  of  using  the 
median  distance  between  samples  as  the  Gaussian  width  could  be  useful  in  some  cases  (Scholkopf 
and  Smola,  2002;  Song  et  al.,  2007).  However,  there  seems  no  strong  justification  for  this  heuristic 
and  the  choice  of  other  tuning  parameters  is  still  open. 

A  probabilistic  classifier  that  separates  training  samples  from  test  samples  can  be  used  for  di¬ 
rectly  estimating  the  importance,  for  example,  a  logistic  regression  (LogReg)  classifier  (Qin,  1998; 
Cheng  and  Chu,  2004;  Bickel  et  ah,  2007).  Maximum  likelihood  estimation  of  LogReg  models 
can  be  formulated  as  a  convex  optimization  problem,  so  the  unique  global  optimal  solution  can  be 
obtained.  Furthermore,  since  the  LogReg-based  method  only  involves  a  standard  supervised  clas¬ 
sification  problem,  the  tuning  parameters  such  as  the  kernel  width  and  the  regularization  parameter 
can  be  optimized  based  on  the  standard  cross-validation  procedure.  This  is  a  very  useful  property 
in  practice. 

The  Kullback-Leibler  importance  estimation  procedure  (KLIEP)  (Sugiyama  et  ah,  2008b;  Nguyen 
et  ah,  2008)  also  directly  gives  an  estimate  of  the  importance  function  by  matching  the  two  distribu¬ 
tions  in  terms  of  the  Kullback-Leibler  divergence  (Kullback  and  Leibler,  1951).  The  optimization 
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problem  involved  in  KLIEP  is  convex,  so  the  unique  global  optimal  solution — which  tends  to  be 
sparse — can  be  obtained,  when  linear  importance  models  are  used.  In  addition,  the  tuning  parame¬ 
ters  in  KLIEP  can  be  optimized  based  on  a  valiant  of  cross-validation. 

As  reviewed  above,  LogReg  and  KLIEP  arc  more  advantageous  than  KMM  since  the  tuning 
parameters  can  be  objectively  optimized  based  on  cross-validation.  However,  optimization  proce¬ 
dures  of  LogReg  and  KLIEP  arc  less  efficient  in  computation  than  KMM  due  to  high  non-linearity 
of  the  objective  functions  to  be  optimized — more  specifically,  exponential  functions  induced  by  the 
LogReg  model  or  the  log  function  induced  by  the  Kullback-Leibler  divergence.  The  purpose  of 
this  paper  is  to  develop  a  new  importance  estimation  method  that  is  equipped  with  a  build-in  model 
selection  procedure  as  LogReg  and  KLIEP  and  is  computationally  more  efficient  than  LogReg  and 
KLIEP. 

Our  basic  idea  is  to  formulate  the  direct  importance  estimation  problem  as  a  least-squares 
function  fitting  problem.  This  formulation  allows  us  to  cast  the  optimization  problem  as  a  con¬ 
vex  quadratic  program,  which  can  be  efficiently  solved  using  a  standard  quadratic  program  solver. 
Cross-validation  can  be  used  for  optimizing  the  tuning  parameters  such  as  the  kernel  width  or  the 
regularization  parameter.  We  call  the  proposed  method  least-squares  importance  fitting  (LSIF). 
We  further  show  that  the  solutions  of  LSIF  is  piecewise  linear  with  respect  to  the  t\ -regularization 
parameter  and  the  entire  regularization  path  (that  is,  all  solutions  for  different  regularization  pa¬ 
rameter  values)  can  be  computed  efficiently  based  on  the  parametric  optimization  technique  (Best, 
1982;  Efron  et  al.,  2004;  Hastie  et  ah,  2004).  Thanks  to  this  regularization  path  tracking  algorithm, 
LSIF  is  computationally  efficient  in  model  selection  scenarios.  Note  that  in  the  regularization  path 
tracking  algorithm,  we  can  trace  the  solution  path  without  a  quadratic  program  solver — we  just  need 
to  compute  matrix  inverses. 

LSIF  is  shown  to  be  efficient  in  computation,  but  it  tends  to  share  a  common  weakness  of  reg¬ 
ularization  path  tracking  algorithms,  that  is,  accumulation  of  numerical  errors  (Scheinberg,  2006). 
The  numerical  problem  tends  to  be  severe  if  there  arc  many  change  points  in  the  regularization 
path.  To  cope  with  this  problem,  we  develop  an  approximation  algorithm  in  the  same  least-squares 
framework.  The  approximation  version  of  LSIF,  which  we  call  unconstrained  LSIF  (uLSIF),  allows 
us  to  obtain  the  closed-form  solution  that  can  be  computed  just  by  solving  a  system  of  linear  equa¬ 
tions.  Thus  uLSIF  is  numerically  stable  when  regularized  properly.  Moreover,  the  leave-one-out 
cross-validation  score  for  uLSIF  can  also  be  computed  analytically  (cf.  Wahba,  1990;  Cawley  and 
Talbot,  2004),  which  significantly  improves  the  computational  efficiency  in  model  selection  scenar¬ 
ios.  We  experimentally  show  that  the  accuracy  of  uLSIF  is  com  parable  to  the  best  existing  method 
while  its  computation  is  faster  than  other  methods  in  covariate  shift  adaptation  and  outlier  detection 
scenarios. 

Our  contributions  in  this  paper  arc  summarized  as  follows.  A  proposed  density-ratio  estima¬ 
tion  method,  LSIF,  is  equipped  with  cross-validation  (which  is  an  advantage  over  KMM)  and  is 
computationally  efficient  thanks  to  regularization  path  tracking  (which  is  an  advantage  over  KLIEP 
and  LogReg).  Furthermore,  uLSIF  is  computationally  even  more  efficient  since  its  solution  and 
leave-one-out  cross-validation  score  can  be  computed  analytically  in  a  stable  manner.  The  proposed 
methods,  LSIF  and  uLSIF,  arc  similar  in  spirit  to  KLIEP,  but  the  loss  functions  are  different:  KLIEP 
uses  the  log  loss  while  LSIF  and  uLSIF  use  the  squared  loss.  The  difference  of  the  log  functions 
allows  us  to  improve  computational  efficiency  significantly. 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2,  we  propose  a  new  importance 
estimation  procedure  based  on  least-squares  fitting  (LSIF)  and  show  its  theoretical  properties.  In 
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Section  3,  we  develop  an  approximation  algorithm  (uLSIF)  which  can  be  computed  efficiently.  In 
Section  4,  we  illustrate  how  the  proposed  methods  behave  using  a  toy  data  set.  In  Section  5,  we  dis¬ 
cuss  the  characteristics  of  existing  approaches  in  comparison  with  the  proposed  methods  and  show 
that  uLSIF  could  be  a  useful  alternative  to  the  existing  methods.  In  Section  6,  we  experimentally 
compare  the  performance  of  uLSIF  and  existing  methods.  Finally  in  Section  7,  we  summarize  our 
contributions  and  outline  future  prospects.  Those  who  arc  interested  in  practical  implementation 
may  skip  the  theoretical  analyses  in  Sections  2.3,  3.2,  and  3.3. 


2.  Direct  Importance  Estimation 

In  this  section,  we  propose  a  new  method  of  direct  importance  estimation. 


2.1  Formulation  and  Notation 

Let  ©  C  (Mrf)  be  the  data  domain  and  suppose  we  arc  given  independent  and  identically  distributed 
(i.i.d.)  training  samples  {xf  }"2j  from  a  training  distribution  with  density  pt r(x)  and  i.i.d.  test  samples 
{xJYjLi  from  a  test  distribution  with  density  plc  (x) : 

WKV-'FtrM, 

We  assume  that  the  training  density  is  strictly  positive,  that  is, 


pt r(x)  >  0  for  all  x  G  ©. 

The  goal  of  this  paper  is  to  estimate  the  importance  w(x)  from  {x-r}”2]  and  {x^  }^L , : 


w{x) 


Pte  (x) 
Ftr(x)  ’ 


which  is  non-negative  by  definition.  Our  key  restriction  is  that  we  want  to  avoid  estimating  densities 
Pic(x)  and  Pu(x)  when  estimating  the  importance  w(x). 


2.2  Least-squares  Approach  to  Direct  Importance  Estimation 

Let  us  model  the  importance  w(x)  by  the  following  linear  model: 

b 

w(x)  =  £aApf(x),  (1) 

t=  1 

where  a  =  {ti\ ,  0C2 , . . . ,  a/,)T  are  parameters  to  be  learned  from  data  samples,  T  denotes  the  transpose 
of  a  matrix  or  a  vector,  and  { cp/  fx)  ,  are  basis  functions  such  that 

cpz  (x)  >  0  for  all  x  G  ©  and  for  l  =  1,2. ...  ,h. 

Note  that  b  and  {<Pt(x)}^=1  could  be  dependent  on  the  samples  {xf}"“1  and  {x^}”!^,  for  example, 
kernel  models  are  also  allowed.  We  explain  how  the  basis  functions  { cpr  (x)  are  chosen  in 

Section  2.5. 
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We  determine  the  parameters  {oq}*=1  in  the  model  w(x)  so  that  the  following  squared  error  Jq 
is  minimized: 


1  r  2 

-/o(a)  =  2  (w(x)-w(x))  ptv(x)dx 

=  -  j  w(x)2pa(x)dx  —  j  w(x)w(x)pa(x)dx+  -  j  w(x)2ptr(x)dx 

=  -  j  w(x)2 plr(x)dx  —  j  w(x)pte(x)dx+  -  j w(x)2pb(x)dx, 

where  in  the  second  term  the  probability  density  pu(x)  is  canceled  with  that  included  in  w{x). 
The  squared  loss  Joist)  is  defined  as  the  expectation  under  the  probability  of  training  samples.  In 
covariate  shift  adaptation  (see  Section  6.2)  and  outlier  detection  (see  Section  6.3),  the  importance 
values  on  the  training  samples  are  used.  Thus,  the  definition  of  Jq{cl)  well  agrees  with  our  goal. 

The  last  term  of  /o(oc)  is  a  constant  and  therefore  can  be  safely  ignored.  Let  us  denote  the  first 
two  terms  by  J: 

/(a)  =  -  j  w(x)2 ptr(x)dx  —  J w(x)pte(x)dx 

=  \  oqc k'  {  [ cp (>{x)<$t{x)Pti{x)dx\  -  (  f  cp e{x)pte(x)dx 


=  -aTHa-hTa. 
2 


(2) 


where  H  is  the  bxb  matrix  with  the  {£,(')-\h  element 


=  /  cp c(x)(pf>(x)ptr(x)dx, 


(3) 


and  It  is  the  b-dimensional  vector  with  the  t- th  element 


he  =  J  (pe(x)pte(x)dx. 

Approximating  the  expectations  in  J  by  empirical  averages,  we  obtain 

1  ^tr  1  ^te 

A  I-'  ^s./  tr\0  1 


•/(«)  =  sir  £ -  —  L  fv(xj 


2«tr 

1 


1 


»te  j=  i 


n  tr 


«te 


=  ,  I  aeae  ~  £  )<Pf  (*?)  ]  -  L  <P<C 


•  w=\ 


«tr 


1=1 


£=1 


»te 


7=1 


1 


=  -aTHa-h  a, 


-T 


where  H  is  the  bxb  matrix  with  the  (£,£')- th  element 

^  1  "u 
»tr  i_\ 


(4) 
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and  h  is  the  ^-dimensional  vector  with  the  (-th  element 

^  1  "te 

h(  =  —  (5) 

«te  p, 

Taking  into  account  the  non-negativity  of  the  importance  function  w{x),  we  can  formulate  our  opti¬ 
mization  problem  as  follows. 

1  t 

min  -a'Ha-h  a  +  Xl/^a 

aeR6  [2 

subject  to  a  >  0b,  (6) 

where  0/,  and  I  /,  arc  the  b-dimensional  vectors  with  all  zeros  and  ones,  respectively;  the  vector 
inequality  a  >  ()/,  is  applied  in  the  element-wise  manner,  that  is,  op  >  0  for  £  =  1 , 2, . . . ,  b.  In  Eq.  (6), 
we  included  a  penalty  term  X I {  a  for  regularization  puiposes,  where  X  (>  0)  is  a  regularization 
parameter.  The  above  is  a  convex  quadratic  programming  problem  and  therefore  the  unique  global 
optimal  solution  can  be  computed  efficiently  by  a  standard  optimization  package.  We  call  this 
method  Least-Squares  Importance  Fitting  (LSIF). 

We  can  also  use  the  ^-regularize!'  aTa  instead  of  the  /j -regularize!'  I  a  without  changing  the 
computational  property.  However,  using  the  i\  -regularize!'  would  be  more  advantageous  since  the 
solution  tends  to  be  sparse  (Williams,  1995;  Tibshirani,  1996;  Chen  et  ah,  1998).  Furthermore,  as 
shown  in  Section  2.6,  the  use  of  the  /) -regularize!'  allows  us  to  compute  the  entire  regularization 
path  efficiently  (Best,  1982;  Efron  et  ah,  2004;  Hastie  et  al.,  2004).  The  ^-regularization  method 
will  be  used  for  theoretical  analysis  in  Section  3.3. 

2.3  Convergence  Analysis  of  LSIF 

Here,  we  theoretically  analyze  the  convergence  property  of  the  solution  a  of  the  FSIF  algorithm; 
practitioners  may  skip  this  theoretical  analysis. 

Let  d(X)  be  the  solution  of  the  LSIF  algorithm  with  regularization  parameter  X,  and  let  a  ‘(X) 
be  the  optimal  solution  of  the  ‘ideal’  problem: 

min  - aTHa-hTa  +  XlJ1a 

aeR6  L2 

subject  to  ot  >  0,,.  (7) 

Below,  we  theoretically  investigate  the  learning  curve  (Amari  et  ah,  1992)  of  LSIF,  that  is,  we 
elucidate  the  relation  between  J(a(X))  and  J( a*(X))  in  terms  of  the  expectation  over  all  possible 
training  and  test  samples  as  a  function  of  the  number  of  samples. 

Let  E  be  the  expectation  over  all  possible  training  samples  of  size  /itr  and  all  possible  test  sam¬ 
ples  of  size  nte.  Let  J4  C  {1,2,  ...,£>}  be  the  set  of  active  indices  (Boyd  and  Vandenberghe,  2004), 
that  is, 

F\  =  {i  \a*((X)=0,  e  =  l,2 

For  the  active  set  FL  =  ■  ■  ■  ,j\si\}  with  j\  <  ji  <  ■  ■  ■  <  j  ^  ,  let  E  be  the  \FL\  x  b  indicator  matrix 

with  the  (/,/)- 1 h  element 

1  j  =  ju 

0  otherwise. 
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Similarly,  let  !A  be  the  active  set  of  a(A,): 

A  =  {e\ae(X)=0,£  =  l,2 

For  the  active  set  TL  =  {j\ ,  72, . . . ,  j  }  with  ji  <  ji  <  ■  ■  ■  <  jfa\’  'ct  £  be  the  |J4|  x  b  indicator  matrix 
with  the  (/',/)- 1 h  element  similarly  defined  by 

Eu  =  I  1  j  =  (8) 

10  otherwise. 

First,  we  show  the  optimality  condition  of  (6)  which  will  be  used  in  the  following  theoretical 
analyses.  The  Lagrangian  of  the  optimization  problem  (6)  is  given  as 

L(a,E})  =  -aTHa-hT a  +  XlJa-E,T  a, 

where  E,  is  the  /?- dimensional  Lagrange  multiplier  vector.  Then  the  Karush-Kuhn-Tucker  (KKT) 
conditions  (Boyd  and  Vandenberghe,  2004)  arc  expressed  as  follows: 

Ha-h  +  Xlb-^  =  0b,  (9) 

a  >  0b, 

5  >  0/„ 

Z,((Xe  =  0  for  i  =  1,2, . . .  ,b.  (10) 


Let  E,  (X)  be  the  \TL\ -dimensional  vector  with  the  ;-th  element  being  the  /,-th  element  of  E,(X): 


*  =  1 . |S|. 


(11) 


We  assume  that  £,  (X)  only  contains  non- zero  elements  of  E,(X).  Let  G  be 

;T  \ 


G  = 


H  -E 
-E  O 


n|x|j?| 


where  O |^|x|^i  Is  the  J4  x  \TL\  matrix  with  all  zeros.  Then  Eqs.  (9)  and  (10)  are  together  expressed 
in  a  matrix  form  as 


G 


'u(xy 


h-X  I* 


0, 


(12) 


Regarding  the  matrix  G,  we  have  the  following  lemma: 
Lemma  1  The  matrix  G  is  invertible  ifH  is  invertible. 


The  proof  of  the  above  lemma  is  given  in  Appendix  A.  Below,  we  assume  that  El  is  invertible. 
Then  the  inverse  of  G  exists  and  multiplying  G  from  the  left-hand  side  of  Eq.  (12)  yields 


fh  —  Xlb\ 

V  ow  ) 


(13) 
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The  following  inversion  formula  holds  for  block  matrices  (Petersen  and  Pedersen,  2007): 
Ml  M2V!  _  /M”1+M71M2M01M3M“1  -M~]M2M0 


M  3  M4 


-m01m3m-1 


M, 


where 


M0  =M4-M3M]_1M2. 
Applying  Eq.  (14)  to  Eq.  (13),  we  have 

a(k)  =  A(h-k\h), 

where  A  is  defined  by 

A  =  H  1  —H  lET (EH  lET)~1EH  1 . 
When  the  Lagrange  multiplier  vector  satisfies 

iflX)  >  0  for  all  l  €  J4, 


(14) 


(15) 

(16) 

(17) 


we  say  that  the  strict  complementarity  condition  is  satisfied  (Bertsekas  et  al.,  2003).  An  important 
consequence  of  strict  complementarity  is  that  the  optimal  solution  and  the  Lagrange  multipliers  of 
convex  quadratic  problems  are  uniquely  determined.  Then  we  have  the  following  theorem. 


Theorem  2  Let  P  be  the  probability  over  all  possible  training  samples  of  size  nu-  and  test  samples  of 
size  nlc.  Let  If  (X)  be  the  Lagrange  multiplier  vector  of  the  problem  (7)  and  suppose  f  (k)  satisfies 
the  strict  complementarity  condition  (17).  Then,  there  exists  a  positive  constant  c  >  0  and  a  natural 
number  N  such  that  for  mm{ntr,nte}  >  N, 


P(k  fEf)<e 


cmin{«tr,wte} 


The  proof  of  the  above  theorem  is  given  in  Appendix  B .  Theorem  2  shows  that  the  probability 
that  the  active  set  J4  of  the  empirical  problem  (6)  is  different  from  the  active  set  J4  of  the  ideal 
problem  (7)  is  exponentially  small.  Thus  we  may  regal'd  !A  =  !A  in  practice. 

Let  A  be  the  ‘ideal’  counterpart  of  A: 

A  =  H~x -H~1ET(EH~1ET)~1EH~l, 

and  let  Cwy  be  the  b  x  b  covariance  matrix  with  the  (£,£')- th  element  being  the  covariance  between 

w(x)q>e(x)  and  w'(x)qy(x)  under  pa(x).  Let 

w*(x)  =  £aJ(A,)q><(*), 

e=  1 

b 

v{x )  =  52  [A  1  b\iVt{x). 

t=  1 

Let 

fin)  =  co  (g(n)) 

denote  that  f(n)  asymptotically  dominates  g{n)  \  more  precisely,  for  all  C  >  0,  there  exists  n 0  such 
that 

\Cg(n)\  <  |/(n)|  for  all  n  >  no- 
Then  we  have  the  following  theorem. 


1398 


A  Least-squares  Approach  to  Direct  Importance  Estimation 


Theorem  3  Assume  that 

(a)  The  optimal  solution  of  the  problem  (7)  satisfies  the  strict  complementarity  condition  (17). 

(b)  7i tr  and  7z te  satisfy 

«te  =  ®(77“).  (18) 

Then,  for  any  X  >  0,  we  have 

E[/(a(X))]  =  J( a*{X))  +  -^tr(A(CV,w*  -2XCW>))  +  c  f — )  .  (19) 

Z72tr  \  ) 

The  proof  of  the  above  theorem  is  given  in  Appendix  C.  This  theorem  elucidates  the  learning 
curve  of  LSIF  up  to  the  order  of  /itr  1 .  In  Section  2.4.1,  we  discuss  practical  implications  of  this 
theorem. 

2.4  Model  Selection  for  LSIF 

The  practical  performance  of  LSIF  depends  on  the  choice  of  the  regularization  parameter  X  and 
basis  functions  {<Pr(^)}*=1  (which  we  refer  to  as  a  model).  Since  our  objective  is  to  minimize  the 
cost  function  J  defined  in  Eq.  (2),  it  is  natural  to  determine  the  model  such  that  J  is  minimized. 

However,  the  value  of  the  cost  function  J  is  inaccessible  since  it  includes  the  expectation  over 
unknown  probability  density  functions  plr(x)  and  /;Le (x).  The  value  of  the  empirical  cost  J  may  be 
regarded  as  an  estimate  of  /,  but  this  is  not  useful  for  model  selection  purposes  since  it  is  heavily 
biased — the  bias  is  caused  by  the  fact  that  the  same  samples  are  used  twice  for  learning  the  parameter 
a  and  estimating  the  value  of  J.  Below,  we  give  two  practical  methods  of  estimating  the  value  of  J 
in  more  precise  ways. 

2.4.1  Information  Criterion 

In  the  same  way  as  Theorem  3,  we  can  obtain  an  asymptotic  expansion  of  the  empirical  cost 
E  J(a(X))  as  follows: 

E[/(a(X))]  =  J( a*{X))  -  — tr(A(Cw*)W*  +2XCW*,V)) +  o  ( —)  .  (20) 

277(r  \  ft  tr  J 

Combining  Eqs.  (19)  and  (20),  we  have 

E[/(oc(X))]  =  E[/ (oc(A.) )]  4 - tr (ACW*  w* )  +  o  (  —  J  . 

»tr  \  fttr  ) 

From  this,  we  can  immediately  obtain  an  information  criterion  (Akaike,  1974;  Konishi  and  Kita¬ 
gawa,  1996)  for  LSIF: 

^IC)  =  J(a(X))  +  — tr(AC^), 

77  tr 

where  A  is  defined  by  Eq.  (16).  E  is  defined  by  Eq.  (8)  and  Cw  w>  is  the  h  x  b  covariance  matrix  with 
the  (£,£')- th  element  being  the  covariance  between  w(x) cpr(x)  and  w' (xjtyffx)  over  Since 

A  and  Cy(v  are  consistent  estimators  of  A  and  Cw\w>,  the  above  information  criterion  is  unbiased  up 
to  the  order  of  /ilr  1 . 
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Note  that  the  term  tr  (AC  g%g)  may  be  interpreted  as  the  effective  dimension  of  the  model  (Moody, 
1992).  Indeed,  when  w(x)  =  1,  we  have  H  =  Cg$,  and  thus 


tr(AC^)  =  tr(4)  -  K(EC~ET  (EC  ~ETyl)  =b-\k\, 
which  is  the  dimension  of  the  face  on  which  a(X)  lies. 


2.4.2  Cross-validation 


Although  the  information  criterion  derived  above  is  more  accurate  than  just  a  naive  empirical  estima¬ 
tor,  its  accuracy  is  guaranteed  only  asymptotically.  Here,  we  employ  cross-validation  for  estimating 
/(a),  which  has  an  accuracy  guarantee  for  finite  samples. 

First,  the  training  samples  {xfY-f]  and  test  samples  {xj}jl j  are  divided  into  R  disjoint  subsets 
{Artr}f=1  and  {Xf}f=l,  respectively.  Then  an  importance  estimate  wxtr,xte (*)  is  obtained  using 
{Xj}j-gr  and  {XjC}jjtr  (that  is,  without  Xf  and  Xf  ),  and  the  cost  J  is  approximated  using  the 
held-out  samples  Xf  and  Xf  as 


f(cv) 

Jx?,x* 


1 

2\Xf 


^  wx?,x!?(xX*)2 
vreA;tr 


nGH  L  ^A,rArte(xte)' 
I'N  I  xKeXe 


This  procedure  is  repeated  for  r  =  1 , 2 .....  A  and  its  average  JCVi  is  used  as  an  estimate  of  J: 


jiCW) 


_  I  v  ^cv) 

_  r  L-'x'.x'r 


r=  1 


We  can  show  that  ,/'CVj  gives  an  almost  unbiased  estimate  of  the  true  cost ./,  where  the  ‘almost’-ness 
comes  from  the  fact  that  the  number  of  samples  is  reduced  in  the  cross-validation  procedure  due  to 
data  splitting  (Luntz  and  Brailovsky,  1969;  Wahba,  1990;  Scholkopf  and  Smola,  2002). 

Cross-validation  would  be  more  accurate  than  the  information  criterion  for  finite  samples.  How¬ 
ever,  it  is  computationally  more  expensive  than  the  information  criterion  since  the  learning  proce¬ 
dure  should  be  repeated  R  times. 


2.5  Heuristics  of  Basis  Function  Design  for  LSIF 

A  good  model  may  be  chosen  by  cross-validation  or  the  information  criterion,  given  that  a  family  of 
promising  model  candidates  is  prepared.  As  model  candidates,  we  propose  using  a  Gaussian  kernel 
model  centered  at  the  test  points  that  is, 

«te 

w(x)  =  £  a iKa(x,xf), 

e=\ 

where  Kc>(x,xr)  is  the  Gaussian  kernel  with  kernel  width  g: 

Ka(x,x)=ex p  ^2g2^  )  •  (21) 

The  reason  why  we  chose  the  test  points  {xj}Jf]  as  the  Gaussian  centers,  not  the  training  points 
{xf}"f_ i ,  is  as  follows  (Sugiyama  et  al.,  2008b).  By  definition,  the  importance  w(x)  tends  to  take 
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large  values  if  the  training  density  pti(x)  is  small  and  the  test  density  plc (x)  is  large;  conversely, 
w(x)  tends  to  be  small  (that  is,  close  to  zero)  if  pti(x)  is  large  and  plc (x)  is  small.  When  a  function 
is  approximated  by  a  Gaussian  kernel  model,  many  kernels  may  be  needed  in  the  region  where  the 
output  of  the  target  function  is  large;  on  the  other  hand,  only  a  small  number  of  kernels  would  be 
enough  in  the  region  where  the  output  of  the  target  function  is  close  to  zero.  Following  this  heuristic, 
we  allocate  many  kernels  at  high  test  density  regions,  which  can  be  achieved  by  setting  the  Gaussian 
centers  at  the  test  points 

Alternatively,  we  may  locate  («tr  +  nte)  Gaussian  kernels  at  both  {xjr}”"|  and  {xF  }"®j ■  However, 
in  our  preliminary  experiments,  this  did  not  further  improve  the  performance,  but  just  slightly  in¬ 
creased  the  computational  cost.  When  nte  is  large,  just  using  all  the  test  points  {x^ }”'!  ,  as  Gaussian 
centers  is  already  computationally  rather  demanding.  To  ease  this  problem,  we  practically  propose 
using  a  subset  of  {x*?}"“  [  as  Gaussian  centers  for  computational  efficiency,  that  is, 

b 

w(x)  =  Yj  ogMx,q),  (22) 

l=  1 


where  q,  i  =  1,2, ...  ,b  are  template  points  randomly  chosen  from  {xj}J‘i]  without  replacement 
and  b  (<  nte)  is  a  prefixed  number.  In  the  rest  of  this  paper,  we  usually  fix  the  number  of  template 
points  at 

b  =  min(100,«te), 

and  optimize  the  kernel  width  a  and  the  regularization  parameter  X  by  cross-validation  with  grid 
search. 

2.6  Entire  Regularization  Path  for  LSIF 

We  can  show  that  the  LSIF  solution  a  is  piecewise  linear  with  respect  to  the  regularization  parameter 
X  (see  Appendix  D).  Therefore,  the  regularization  path  (that  is,  solutions  for  all  X)  can  be  computed 
efficiently  based  on  the  parametric  optimization  technique  (Best,  1982;  Efron  et  ah,  2004;  Hastie 
et  ah,  2004). 

A  basic  idea  of  regularization  path  tracking  is  to  check  violation  of  the  KKT  conditions — which 
are  necessary  and  sufficient  for  optimality  of  convex  programs — when  the  regularization  parameter 
X  is  changed.  The  KKT  conditions  of  LSIF  are  summarized  in  Section  2.3.  The  strict  comple¬ 
mentarity  condition  (17)  assures  the  uniqueness  of  the  optimal  solution  for  a  fixed  X,  and  thus  the 
uniqueness  of  the  regularization  path.  A  pseudo  code  of  the  regularization  path  tracking  algorithm 
for  LSIF  is  described  in  Figure  1 — its  detailed  derivation  is  summarized  in  Appendix  D.  Thanks  to 
the  regularization  path  algorithm,  LSIF  is  computationally  efficient  in  model  selection  scenarios. 

The  pseudo  code  implies  that  we  no  longer  need  a  quadratic  programming  solver  for  obtaining 
the  solution  of  LSIF — just  computing  matrix  inverses  is  enough.  Furthermore,  the  regularization 
path  algorithm  is  computationally  more  efficient  when  the  solution  is  sparse,  that  is,  most  of  the 
elements  are  zero  since  the  number  of  change  points  tends  to  be  small  for  such  sparse  solutions. 

Even  though  the  regularization  path  tracking  algorithm  is  computationally  efficient,  it  tends  to 
be  numerically  unreliable,  as  we  experimentally  show  in  Section  4.  This  numerical  instability  is 
caused  by  near  singularity  of  the  matrix  G.  When  G  is  nearly  singular,  it  is  not  easy  to  accurately 
obtain  the  solutions  u,v  in  Ligure  1,  and  therefore  the  change  point  A,t+i  cannot  be  accurately  com¬ 
puted.  As  a  result,  we  cannot  accurately  update  the  active  set  of  the  inequality  constraints  and  thus 
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Input:  H  and  li  %  see  Eqs.  (4)  and  (5)  for  the  definitions 
Output:  entire  regularization  path  a  (A.)  for  A.  >  0 

—  o; 

k  < —  argmax({/?(-  |  i  =  1,2, . . .  ,b}\ 

Er  '  h  k , 

—  {1,2  r..,b}\{ky, 

a(Xr)  < —  Oh',  %  the  vector  with  all  zeros 

While  K  >  0 

E  * — O  -  %  the  matrix  with  all  zeros 

|J4|  xr 

For/  =  1,2,  ...,|J4| 

Eiji  < —  1;  %  A  =  {ji,j2, ■  ■  I  ;i  <  h  <  ■  ■  ■  <  jfi\} 

end 


If  v  <  0,  . ,  9,  %  the  final  interval 

—  b+  \%\ 

^i+i  *  0; 

a(^TH-t)  < -  (ui,U2,  ■  ■  ■  ,ub)T 

else  %  an  intermediate  interval 

k< — argmax(.{M(/v(-  |  v,  >  0,  /  =  1,2, ...,b  +  \JZ\}; 
Vrt  < —  max{0,n*/vyt}; 

a(^c+t)  < —  (ni,n2,...,nfe)T-^i;+i(vi,V2,...,Vfe)T; 

If  1  <k<b 

— 4U{1}; 

else 

A  < —  %\{jk-bY, 

end 


end 


end 

x< —  x+  1; 


a(X) 


0b 


if  h  >  Xo 
if  Xt+i  <  A,  <  X-n 


Figure  1 :  Pseudo  code  for  computing  the  entire  regularization  path  of  LSIF.  When  the  computation 
of  G  is  numerically  unstable,  we  may  add  small  positive  diagonals  to  H  for  stabilization 
purposes. 
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the  obtained  solution  a(X)  becomes  unreliable;  furthermore,  such  numerical  error  tends  to  be  accu¬ 
mulated  through  the  path-tracking  process.  This  instability  issue  seems  to  be  a  common  pitfall  of 
solution  path  tracking  algorithms  in  general  (see  Scheinberg,  2006). 

When  the  Gaussian  width  a  is  very  small  or  very  large,  the  matrix  H  tends  to  be  nearly  singular 
and  thus  the  matrix  G  also  becomes  nearly  singular.  On  the  other  hand,  when  the  Gaussian  width  a  is 
not  too  small  or  too  large  compared  with  the  dispersion  of  samples,  the  matrix  G  is  well-conditioned 
and  therefore  the  path-following  algorithm  would  be  stable  and  reliable. 


3.  Approximation  Algorithm 

Within  the  quadratic  programming  formulation,  we  have  proposed  a  new  importance  estimation 
procedure  LSIF  and  showed  its  theoretical  properties.  We  also  gave  a  regularization  path  tracking 
algorithm  that  can  be  computed  efficiently.  However,  as  we  experimentally  show  in  Section  4,  it 
tends  to  suffer  from  a  numerical  problem  and  therefore  is  not  practically  reliable.  In  this  section,  we 
give  a  practical  alternative  to  LSIF  which  gives  an  approximate  solution  to  LSIF  in  a  computation¬ 
ally  efficient  and  reliable  manner. 


3.1  Unconstrained  Least-squares  Formulation 


The  approximation  idea  we  introduce  here  is  very  simple:  we  ignore  the  non-negativity  constraint 
of  the  parameters  in  the  optimization  problem  (6).  This  results  in  the  following  unconstrained 
optimization  problem. 


min 

peR6 


(23) 


In  the  above,  we  included  a  quadratic  regularization  term  [3  1  (3/2,  instead  of  the  linear  one  1 J [3  since 
the  lineal-  penalty  term  does  not  work  as  a  regularizer  without  the  non-negativity  constraint.  Eq.  (23) 
is  an  unconstrained  convex  quadratic  program,  so  the  solution  can  be  analytically  computed  as 


=  {H +  Xlb)~lh, 


where  Ib  is  the  /^-dimensional  identity  matrix.  Since  we  dropped  the  non-negativity  constraint  [3  > 
()/,,  some  of  the  learned  parameters  could  be  negative.  To  compensate  for  this  approximation  error, 
we  modify  the  solution  by 

(3(X)  =  max(0fo,P(X)), 

where  the  ‘max’  operation  for  a  pair  of  vectors  is  applied  in  the  element-wise  manner.  This  is  the 
solution  of  the  approximation  method  we  propose  in  this  section. 

An  advantage  of  the  above  unconstrained  formulation  is  that  the  solution  can  be  computed  just 
by  solving  a  system  of  linear  equations.  Therefore,  its  computation  is  stable  when  L  is  not  too  small. 
We  call  this  method  unconstrained  LSIF  (uLSIF).  Due  to  the  i 2  regularizer,  the  solution  tends  to 
be  close  to  0b  to  some  extent.  Thus,  the  effect  of  ignoring  the  non-negativity  constraint  may  not  be 
so  strong — later,  we  analyze  the  approximation  error  both  theoretically  and  experimentally  in  more 
detail  in  Sections  3.3  and  4.5. 

Note  that  LSIF  and  uLSIF  differ  only  in  parameter  learning.  Thus,  the  basis  design  heuristic  of 
LSIF  given  in  Section  2.5  is  also  valid  for  uLSIF. 
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3.2  Convergence  Analysis  of  uLSIF 

Here,  we  theoretically  analyze  the  convergence  property  of  the  solution  P(X)  of  the  uLSIF  algo¬ 
rithm;  practitioners  may  skip  Sections  3.2  and  3.3. 

Let  (3°(X)  be  the  optimal  solution  of  the  ‘ideal’  version  of  the  problem  (23): 

r//[3-/2Tp+|pT(3  . 

=  max(0/„P°(X)), 

=  Bl]h,  (24) 

=  H  +  lIb. 

Below,  we  theoretically  investigate  the  learning  curve  of  uLSIF. 

Let  ‘B  C  {1,2,  ...,/?}  be  the  set  of  negative  indices  of  (3°(2i),  that  is, 

‘B  =  {£\^(X)<0,£=l,2,...,b}, 

and  B  C  {1 , 2, . . . ,  b}  be  the  set  of  negative  indices  of  P(X),  that  is, 

B  =  {£\MX)<0,  e=l,2,.-,b}. 

Then  we  have  the  following  theorem. 


min 


2P~ 


Then  the  ideal  solution  P*  (X)  is  given  by 

rw 

P°W 

Bx 


Theorem  4  Assume  that  P^(X)  f  0  for  i=  1.2....,/?.  Then,  there  exists  a  positive  constant  c  and 
a  natural  number  N  such  that  for  min{/?tr,  «te}  >  N, 


P(BfB)  <e 


cmin{«tr,«te} 


The  proof  of  the  above  theorem  is  given  in  Appendix  E.  The  assumption  that  p)  (X)  f  0  for 
t=  1.2,...,/?  corresponds  to  the  strict  complementarity  condition  (17)  in  LSIF.  Theoreni4  shows 
that  the  probability  that  B  is  different  from  B  is  exponentially  small.  Thus  we  may  regal'd  B  =  B  in 
practice. 

Let  D  be  the  Z?-dimensional  diagonal  matrix  with  the  £-th  diagonal  element 

Jo  leB, 

L/££  =  < 

I  1  otherwise. 

Let 

"°to  =  EP?w<fcM, 

t=\ 

u(x)  =  Z[BllD(Hp(l)-h)\£<V£(x). 

e=  i 

Then  we  have  the  following  theorem. 
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Theorem  5  Assume  that 

(a)  ^(X)^0for£=l,2,...,b. 

(b)  nti  and  nte  satisfy  Eq.  (18). 

Then,  for  any  X  >  0,  we  have 

nj(m)]=J($\X))  +  ^{BlxDHDBllCwo^  +  2Bl'Cw«,u)  +  o(-')  .  (25) 

Z?I(r  \  W-tr  / 

The  proof  of  the  above  theorem  is  given  in  Appendix  F.  Theorem  5  elucidates  the  learning  curve 
of  uLSIF  up  to  the  order  of  nu.  1 .  An  information  criterion  may  be  obtained  in  the  same  way  as 
Section  2.4.1.  However,  as  shown  in  Section  3.4,  we  can  have  a  closed-form  expression  of  the 
leave-one-out  cross-validation  score  for  uLSIF,  which  would  be  practically  more  useful.  For  this 
reason,  we  do  not  go  into  the  detail  of  information  criterion. 


3.3  Approximation  Error  Bounds  for  uLSIF 

The  uLSIF  method  is  introduced  as  an  approximation  of  LSIF.  Here,  we  theoretically  evaluate  the 
difference  between  the  uLSIF  solution  [3(A)  and  the  LSIF  solution  tl(X).  More  specifically,  we  use 
the  following  normalized  L2-norm  on  the  training  samples  as  the  difference  measure  and  derive  its 
upper  bounds: 

2 


Villi  1/  V  J  —  ^ 

where  the  importance  function  w(x;  a)  is  given  by 


\  — 


infv>o  \  „lr  iZi  ; «(*-'))  -  Hxf ;  P(X' 


b 

w(x;  a)  =  ^  atq>f(x). 

t=  i 


In  the  theoretical  analysis  below,  we  assume 

Htr  ^ 

o. 

i=  1 

For  p  G  NU  {°°},  let  ||  •  ||p  be  the  Lp-norm,  and  let  ||a||g  be 

||a||§  =  V  aTHa ,  (27) 

where  H  is  the  b  x  h  matrix  defined  by  Eq.  (4).  Then  we  have  the  following  theorem. 

Theorem  6  (Norm  bound)  Assume  that  all  basis  functions  satisfy 

0  <  <p(  {x)  <  1. 
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Then  we  have 

diffm  <  — ^ 

LZM- 

<  b1  ^1  + 

where  b  is  the  number  of  basis  functions.  The  upper  bound  (29)  is  reduced  as  the  regularization 
parameter  X  increases.  For  the  Gaussian  basis  function  model  (22),  the  upper  bound  (29)  is  reduced 
as  the  Gaussian  width  G  increases. 

The  proof  of  the  above  theorem  is  given  in  Appendix  G.  We  call  Eq.  (28)  the  norm  bound  since 
it  is  governed  by  the  norm  of  [3.  Intuitively,  the  approximation  error  of  uLSIF  would  small  if  X  is 
large  since  P  >  0  may  not  be  severely  violated  due  to  the  strong  regularization  effect.  The  upper 
bound  (29)  justifies  this  intuitive  claim  since  the  error  bound  tends  to  be  small  if  the  regularization 
parameter  X  is  large.  Furthermore,  the  upper  bound  (29)  shows  that  for  the  Gaussian  basis  function 
model  (22),  the  error  bound  tends  to  be  small  if  the  Gaussian  width  a  is  large.  This  is  also  intuitive 
since  the  Gaussian  basis  functions  arc  nearly  flat  when  the  Gaussian  width  a  is  large — a  difference 
in  parameters  does  not  cause  a  significant  change  in  the  learned  importance  function  w(x).  From 
the  above  theorem,  we  expect  that  uFSIF  is  a  nice  approximation  of  FSIF  when  X  is  large  and  a  is 
large.  In  Section  4.5,  we  numerically  investigate  this  issue. 

Below,  we  give  a  more  sophisticated  bound  on  diff(X).  To  this  end,  let  us  introduce  an  interme¬ 
diate  optimization  problem  defined  by 


Oil 


H 


;PM) 


«te 


X  J  min,,  Jfi'l  1  qv (*f )  mine E”=  j  cp* (xj )  ’ 


(28) 

(29) 


subject  to  y  >  0fo,  (30) 


which  we  refer  to  as  LSIF  with  quadratic  penalty  (FSIFq).  FSIFq  bridges  FSIF  and  uFSIF  since 
the  ‘goodness-of-fit’  paid  is  the  same  as  FSIF  but  the  ‘regularization’  paid  is  the  same  as  uFSIF.  Fet 
y(X)  be  the  optimal  solution  of  FSIFq  (30).  Based  on  the  solution  of  FSIFq,  we  have  the  following 
upper  bound. 


Theorem  7  (  Bridge  bound)  For  any  X  >  0,  the  following  inequality  holds: 

jx(\mh  •  iitoiu-  mm 

diff(X)  <  ^ - -.  (31) 

The  proof  of  the  above  theorem  is  given  in  Appendix  H.  We  call  the  above  bound  the  bridge 
bound  since  the  bridged  estimator  y(X)  plays  a  central  role  in  the  bound.  Note  that,  in  the  bridge 
bound,  the  inside  of  the  square  root  is  assured  to  be  non-negative  due  to  Holder’s  inequality  (see 
Appendix  H  for  detail).  The  bridge  bound  is  generally  much  sharper  than  the  norm  bound  (28),  but 
not  always  (see  Section  4.5  for  numerical  examples). 


1406 


A  Least-squares  Approach  to  Direct  Importance  Estimation 


3.4  Efficient  Computation  of  Leave-one-out  Cross-validation  Score  for  uLSIF 

A  practically  important  advantage  of  uLSIF  over  LSIF  is  that  the  score  of  leave-one-out  cross- 
validation  (LOOCV)  can  be  computed  analytically — thanks  to  this  property,  the  computational 
complexity  for  performing  LOOCV  is  the  same  order  as  just  computing  a  single  solution. 

In  the  current  setup,  we  arc  given  two  sets  of  samples,  and  {xf  }Jf , ,  which  generally 

have  different  sample  size.  For  simplicity,  we  assume  that  nlr  <  nte  and  the  i-th  training  sample  xf 
and  the  i-th  test  sample  xf  arc  held  out  at  the  same  time;  the  test  samples  {xJYjLnt  +1  are  always 
used  for  importance  estimation.  Note  that  this  assumption  is  only  for  the  sake  of  simplicity;  we  can 
change  the  order  of  test  samples  without  sacrificing  the  computational  advantages. 

Let  ivW  (x)  be  an  estimate  of  the  importance  obtained  without  the  i-th  training  sample  xf  and  the 
i-th  test  sample  xf.  Then  the  LOOCV  score  is  expressed  as 


LOOCV  = 


i  ,,tr  r  i 

-E  9(w«(xf))2-vv«(xte) 

«tr  t 1  L2 


(32) 


Our  approach  to  efficiently  computing  the  LOOCV  score  is  to  use  the  Sherman-Woodbury -Morrison 
formula  (Golub  and  Loan,  1996)  for  computing  matrix  inverses:  for  an  invertible  square  matrix  A 
and  vectors  u  and  v  such  that  v  ,4  1  u  /  —  I , 


(A  +  uvt)  1  =  A  1 


A~xuvTA~x 
1  +vTA~lu' 


(33) 


Efficient  approximation  schemes  of  LOOCV  have  often  been  investigated  under  asymptotic 
setups  (Stone,  1974;  Hansen  and  Larsen,  1996).  On  the  other  hand,  we  provide  the  exact  LOOCV 
score  of  uLSIF,  which  follows  the  same  line  as  that  of  ridge  regression  (Hoerl  and  Kennard,  1970; 
Wahba,  1990). 

A  pseudo  code  of  uLSIF  with  LOOCV-based  model  selection  is  summarized  in  Figure  2 — its 
detailed  derivation  is  described  in  Appendix  I.  Note  that  the  basis-function  design  heuristic  given 
in  Section  2.5  is  used  in  the  pseudo  code,  but  the  analytic  form  of  the  LOOCV  score  is  available  for 
any  basis  functions. 


4.  Illustrative  Examples 

In  this  section,  we  illustrate  the  behavior  of  LSIL  and  uLSIL  using  a  toy  data  set. 

4.1  Setup 

Let  the  dimension  of  the  domain  be  cl  =  1  and  the  training  and  test  densities  be 

Ptr(x)  =  TVXx;l,(l/2)2), 

Pte(x)  =  9£(x; 2,  (1/4)2), 

where  A[{x\p.(52)  denotes  the  Gaussian  density  with  mean  p  and  variance  a2.  These  densities  arc 
depicted  in  Ligure  3.  The  task  is  to  estimate  the  importance  w{x)  =  pte(x) / Pu-ix). 
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Input:  {*-r}”“1  and  {x*?}"®  t 

Output:  w(x) 

b  < —  min(100,zzte);  n  < —  min(ntr,»te); 

Randomly  choose  b  centers  {q}*=1  from  j  without  replacement; 

For  each  candidate  of  Gaussian  width  o 

""  /  ||xf-Q||2  +  ||xf-Q/||2 


_ 

1  ' 

He/  * 

ZZtr  i 

he  < — 

1  rite 

--I 

ZZte 

ytr  , 

Ae,i  ^ 

-exp  ^ 

y  te 

Ae,i 

-exp  ( 

2o2 


for  £,£'  =  1,2, ...  ,b; 


1~ct  IP 


^  2o2 

K~ce\\2 
2o2 

K-ce\\2 


for  £=  1,2  ,...,b; 


2  a2 


for  i  =  1 , 2, . . . , n  and  £  =  1,2 
for  i=  1,2 and  £  =  1,2,...,/?; 


For  each  candidate  of  regularization  parameter  X 
B 


h+xk-2)4; 


«u 


Bo  < 

Bi  < 

B2  < 

H’tr 


-l- 


B  ‘/ilj  +B  ‘A^diag 


liag  f - 


hJB  Vr 


\»trl^  1  UXtr*B  Xtr)/ 


—  l  fP  — i  /  1,  (Xte*B  Atr)  \ 

B  Ate+B  Atrdiag - -  ; 

Vu,rl„T-l  J(Xa*B  X«)J 


max  Ot, 


«tr  -  1 


(nteBo-B 


<n’  ^K-l)' 

(1*  (Xtr  *B2))T;  wte  <—  (1 1 (Xte  *B2))T; 


LOOCV(o,X) 


end 


end 


(S,X) 

< - 

He/  <- 

n 

1 

he  < — 

ZZte 

a  < — 

max 

w(x)  *■ 

i 

i- 

1  n  tr 

i£exp(-^ 

hr  i=1 

^te 


WfrVrtr  _  ljwte 
2zz  zz 


!  +  l|xf-QH|2 


2o2 


for  £,£' =  l,2,...,b; 


Lm  j 
exp - 


j=  i 

4 

fe 

l 

l=\ 


'-cdY 

2d2 


for /=  1,2  ,...,b; 


/  ||x-q||2 

£a,exPf-L_i 


Figure  2:  Pseudo  code  of  uLSIF  algorithm  with  LOOCV.  B*B'  denotes  the  element-wise  multi¬ 
plication  of  matrices  B  and  B'  of  the  same  size,  that  is,  the  (z./)-th  element  is  given  by 
BjjB’j  j.  For  zz-dimensional  vectors  b  and  b' ,  diag  (y)  denotes  the  n  x  n  diagonal  matrix 
with  z'-th  diagonal  element  /z  ,■///.  A  MATLAB  ®  or  R  implementation  of  uLSIF  is  avail¬ 
able  from  http :  / / sugiyama-www. cs  .  titech . ac .  jp/~sugi/ software/ uLSIF/. 
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4.2  Importance  Estimation 

First,  we  illustrate  the  behavior  of  LSIF  and  uLSIF  in  importance  estimation.  We  set  the  number 
of  training  and  test  samples  at  ntr  =  200  and  nte  =  1000,  respectively.  We  use  the  Gaussian  kernel 
model  (22),  and  the  number  of  basis  functions  is  set  at  b  =  100.  The  centers  of  the  kernel  function 
are  randomly  chosen  from  the  test  points  {xf  }”=1  without  replacement  (see  Section  2.5). 

We  test  different  Gaussian  widths  a  and  different  regularization  parameters  X.  The  following 
two  setups  arc  examined: 

(A)  X  is  fixed  at  X  =  0.2  and  a  is  changed  as  0. 1  <  a  <  1.0, 

(B)  a  is  fixed  at  a  =  0.3  and  X  is  changed  as  0  <  X  <  0.5. 

Figure  4  depicts  the  true  importance  and  its  estimates  obtained  by  LSIF  and  uLSIF,  where  all 
importance  functions  arc  normalized  so  that  /  w(x)dx  =  1  for  better  comparison.  Figures  4(a)  and 
4(b)  show  that  the  estimated  importance  w{x)  tends  to  be  too  peaky  when  the  Gaussian  width  a 
is  small,  while  it  tends  to  be  overly  smoothed  when  a  is  large.  If  the  Gaussian  width  is  chosen 
appropriately,  both  LSIF  and  uLSIF  seem  to  work  reasonably  well.  As  shown  in  Figures  4(c) 
and  4(d),  the  solutions  of  LSIF  and  uLSIF  also  significantly  change  when  different  regularization 
parameters  X  are  used.  Again,  given  that  the  regularization  parameter  is  chosen  appropriately,  both 
LSIF  and  uLSIF  tend  to  perform  well. 

From  the  graphs,  we  also  observe  that  model  selection  based  on  cross-validation  works  rea¬ 
sonably  well  for  both  LSIF  (5-fold)  and  uLSIF  (leave-one-out)  to  choose  appropriate  values  of  the 
Gaussian  width  or  the  regularization  parameter;  this  will  be  analyzed  in  more  detail  in  Section  4.4. 

4.3  Regularization  Path 

Next,  we  illustrate  how  the  regularization  path  tracking  algorithm  for  LSIF  behaves.  We  set  the 
number  of  training  and  test  samples  at  «tr  =  50  and  nte  =  100,  respectively.  For  better  illustration, 
we  set  the  number  of  basis  functions  at  a  small  value  as  b  =  30  in  the  Gaussian  kernel  model  (22) 
and  use  the  Gaussian  kernels  centered  at  equidistant  points  in  [0, 3]  as  basis  functions. 

We  use  the  algorithm  described  in  Figure  1  for  regularization  path  tracking.  Theoretically,  the 
inequality  A,t+i  <  Xx  is  assured.  In  numerical  computation,  however,  the  inequality  is  occasionally 
violated.  In  order  to  avoid  this  numerical  problem,  we  slightly  regularize  H  for  stabilization  (see 
also  the  caption  of  Figure  1). 

Figure  5  depicts  the  values  of  the  estimated  coefficients  {oq}*=1  as  functions  of  ||a||i  for 
a  =  0.1, 0.3,  and  0.5.  Note  that  small  ||a||j  corresponds  to  large  X.  The  figure  indicates  that  the 
regularization  parameter  X  works  as  a  sparseness  controlling  factor  of  the  solution,  that  is,  the  larger 
(smaller)  the  value  of  X  (||a||i)  is,  the  sparser  the  solution  is. 

The  path  following  algorithm  is  computationally  efficient  and  therefore  practically  very  attrac¬ 
tive.  However,  as  the  above  experiments  illustrate,  the  path  following  algorithm  is  numerically 
rather  unstable.  Modification  of  H  can  ease  to  solve  this  problem,  but  this  in  turn  results  in  accu¬ 
mulating  numerical  errors  through  the  path  tracking  process.  Consequently,  the  solutions  for  small 
X  tend  to  be  inaccurate.  This  problem  becomes  prominent  if  the  number  of  change  points  in  the 
regularization  path  is  large. 
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Figure  3:  The  solid  line  is  the  probability  density  of  training  data,  and  the  dashed  line  is  the  proba¬ 
bility  density  of  test  data. 


(a)  LSIF  for  X  =  0.2,  o  =  0.1, 0.4. 1 .0. 


(b)  uLSIF  for  X  =  0.2,  a  =  0.1,03, 1.0. 


(c)  LSIF  for  a  =  0.3,  X  =  0,0.2, 0.5. 


(d)  uLSIF  for  a  =  0.3,  X  =  0,0.09,0.5. 


Figure  4:  True  and  estimated  importance  functions  obtained  by  LSIF  and  uLSIF  for  various  differ¬ 
ent  Gaussian  widths  a  and  regularization  parameters  k. 
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(c)  a  =  0.5. 


Figure  5:  Regularization  path  of  LSIF:  the  values  of  the  estimated  coefficients  { CX/  }f_ ,  arc  depicted 
as  functions  of  the  Li-norm  of  the  estimated  parameter  vector  for  a  =  0.1, 0.3,  and  0.5. 
Small  || ot||  i  corresponds  to  large  X. 


4.4  Cross-validation 

Here  we  illustrate  the  behavior  of  the  cross-validation  scores  of  LSIF  and  uLSIF.  We  set  the  number 
of  training  and  test  samples  at  ntr  =  200  and  n,c  =  1000,  respectively.  The  number  of  template 
points  is  b  =  100  and  the  Gaussian  kernel  model  (22)  is  used.  The  centers  of  the  kernel  functions 
arc  randomly  chosen  from  the  test  points  as  described  in  Section  4.2.  The  left  column  of  Figure  6 
depicts  the  expectation  of  the  true  cost  J(a)  over  50  runs  for  LSIF  and  its  estimate  by  5-fold  CV  (25, 
50,  and  75  percentiles  arc  plotted  in  the  figure)  as  functions  of  the  Gaussian  width  a  for  X  =  0.2,  0.5, 
and  0.8.  We  used  the  regularization  path  tracking  algorithm  for  computing  the  solutions  of  LSIF. 
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0.1  0.2  0.5  1.0 


a  (Gaussian  width) 

(a)  LSIF  with  5CV  (k  =  0.2). 


0.1  0.2  0.5  1.0 


a  (Gaussian  width) 

(c)  LSIF  with  5CV  (k  =  0.5). 


a  (Gaussian  width) 


(b)  uLSIF  with  LOOCV  (k  =  0.2). 


0.1  0.2  0.5  1.0 


o  (Gaussian  width) 

(d)  uLSIF  with  LOOCV  a  =  0.5). 


o 

o 


C\] 

o 

I 


o 


CD 

O 


(e)  LSIF  with  5CV  (k  =  0.8).  (f)  uLSIF  with  LOOCV  {k  =  0.8). 

Figure  6:  The  true  cost  J  and  its  cross-validation  estimate  as  functions  of  Gaussian  width  a  for 
different  values  of  X.  The  solid  line  denotes  the  expectation  of  the  true  cost  J  over  50 
runs,  while  ‘o’  and  error  bars  denote  the  25,  50,  and  75  percentiles  of  the  cross-validation 
score. 


0.1  0.2  0.5  1.0  0.1  0.2  0.5  1.0 

a  (Gaussian  width)  a  (Gaussian  width) 
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The  right  column  shows  the  expected  true  cost  and  its  LOOCV  estimates  for  uLSIF  in  the  same 
manner. 

The  graphs  show  that  overall  CV  gives  reasonably  good  approximations  of  the  expected  cost, 
although  CV  for  LSIF  with  small  X  and  small  a  is  rather  inaccurate  due  to  numerical  problems — the 
solution  path  of  LSIF  is  computed  from  A.  =  °o  to  X  =  0,  and  the  numerical  error  is  accumulated  as 
the  tracking  process  approaches  to  X  =  0.  This  phenomenon  seems  problematic  when  a  is  small. 

4.5  Difference  between  LSIF  and  uLSIF 

In  Section  3.3,  we  analyzed  the  approximation  error  of  uLSIF  against  LSIF.  Here  we  numerically 
investigate  the  behavior  of  the  approximation  error  (26)  as  well  as  the  norm  bound  (28)  and  the 
bridge  bound  (31).  We  set  the  number  of  training  and  test  samples  at  ntr  =  200  and  nte  =  1000, 
respectively.  The  number  of  template  points  in  the  Gaussian  kernel  model  (22)  is  set  at  b  =  100. 
The  centers  of  the  kernel  functions  are  randomly  chosen  from  the  test  points  (see  Section  4.2). 

Figure  7  depicts  the  true  approximation  error  as  well  as  its  upper  bounds  as  functions  of  the  reg¬ 
ularization  parameter  X',  X  is  varied  from  0.001  to  10  and  the  three  Gaussian  widths  a  =  0.1, 0.5, 1.0 
are  tested.  The  graphs  show  that  when  X  and  a  are  large,  the  approximation  error  tends  to  be  small; 
this  is  in  good  agreement  with  the  theoretical  analysis  given  in  Section  3.3.  The  bridge  bound  is 
fairly  tight  in  the  entire  range  and  is  sharper  than  the  norm  bound  except  when  a  is  small  and  X  is 
large. 

4.6  Summary 

Through  the  numerical  examples,  we  overall  found  that  LSIF  and  uLSIF  give  qualitatively  similar 
results.  However,  the  computation  of  the  solution-path  tracking  algorithm  for  LSIF  tends  to  be 
numerically  unstable,  which  can  result  in  unreliable  model  selection  performance.  On  the  other 
hand,  only  a  system  of  linear  equations  needs  to  be  solved  in  uLSIF,  which  turned  out  to  be  much 
more  stable  than  LSIF.  Thus,  uLSIF  would  be  practically  more  reliable  than  LSIF. 

Based  on  the  above  findings,  we  will  focus  on  uLSIF  in  the  rest  of  this  paper. 

5.  Relation  to  Existing  Methods 

In  this  section,  we  discuss  the  characteristics  of  existing  approaches  in  comparison  with  the  pro¬ 
posed  methods. 

5.1  Kernel  Density  Estimator 

The  kernel  density  estimator  (KDE)  is  a  non-parametric  technique  to  estimate  a  probability  density 
function  p (x)  from  its  i.i.d.  samples  {xu}'l=i  -  For  the  Gaussian  kernel  (21),  KDE  is  expressed  as 

j?w  =  „„.(2  byniK^- 

The  performance  of  KDE  depends  on  the  choice  of  the  kernel  width  a.  The  kernel  width  a  can 
be  optimized  by  likelihood  cross-validation  (LCV)  as  follows  (Hardle  et  ah,  2004):  First,  divide 
the  samples  {x,}”=1  into  R  disjoint  subsets  .  Then  obtain  a  density  estimate  pxk{x)  from 
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(a)  0  =  0.1.  (b)  0  =  0.5. 


(c)  o=  1.0. 


Figure  7:  The  approximation  error  of  uLSIF  against  LSIF  as  functions  of  the  regularization  param¬ 
eter  X  for  different  Gaussian  width  a.  Its  upper  bounds  are  also  plotted  in  the  graphs. 


{Xr}r^k  (i.e.,  without  Xu)  and  compute  its  log-likelihood  for  Xp. 


1 

W\ 


E  lo§  £**(*)■ 

xEXk 


Repeat  this  procedure  for  r  =  1 , 2, , . .  ,R  and  choose  the  value  of  a  such  that  the  average  of  the  above 
hold-out  log-likelihood  over  all  r  is  maximized.  Note  that  the  average  hold-out  log-likelihood  is  an 
almost  unbiased  estimate  of  the  Kullback-Leibler  divergence  from  p(x)  to  p(x),  up  to  an  irrelevant 
constant. 

KDE  can  be  used  for  importance  estimation  by  first  obtaining  density  estimators  pu(x)  and 
Pic(x)  separately  from  {xjr}"“1  and  and  then  estimating  the  importance  by  w(x)  = 

Pie  (x)  /ptr(x).  A  potential  limitation  of  this  approach  is  that  KDE  suffers  from  the  curse  of  dimen¬ 
sionality  (Vapnik,  1998;  Hardle  et  ah,  2004),  that  is,  the  number  of  samples  needed  to  maintain  the 
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same  approximation  quality  grows  exponentially  as  the  dimension  of  the  domain  increases.  This  is 
critical  when  the  number  of  available  samples  is  limited.  Therefore,  the  KDE-based  approach  may 
not  be  reliable  in  high-dimensional  problems. 


5.2  Kernel  Mean  Matching 

The  kernel  mean  matching  (KMM)  method  allows  us  to  directly  obtain  an  estimate  of  the  impor¬ 
tance  values  at  training  points  without  going  through  density  estimation  (Huang  et  ah,  2007).  The 
basic  idea  of  KMM  is  to  find  w(x)  such  that  the  mean  discrepancy  between  nonlinearly  transformed 
samples  drawn  from  pte(x)  and  pt r(jt)  is  minimized  in  a  universal  reproducing  kernel  Hilbert  space 
(Steinwart,  2001).  The  Gaussian  kernel  (21)  is  an  example  of  kernels  that  induce  universal  repro¬ 
ducing  kernel  Hilbert  spaces  and  it  has  been  shown  that  the  solution  of  the  following  optimization 
problem  agrees  with  the  true  importance: 


min 

w(x) 


Ka(x,  ■)ple(x)dx  —  J  Ka(x,  ■)w(x)ptT{x)dx 
subject  to  J  w(x)ptt(x)dx  =  1  and  w(x)  >  0, 


2 

A 


where  ||  •  ||  n-  denotes  the  norm  in  the  Gaussian  reproducing  kernel  Hilbert  space  and  KG(x.xr)  is  the 
Gaussian  kernel  (21). 

An  empirical  version  of  the  above  problem  is  reduced  to  the  following  quadratic  program: 


min 


i  nlr  n„ 

-  £  WtWi:Ka(x'f,X^)  -  ^VV/K, 
Z  1  1=1 


subject  to 


ntr 


£  W‘  ~  ntr 
1=1 


<  ntre  and  0  <  w\,W2,  ■  ■ .  ,wntr  <  B, 


where 


K  1=^^). 
11 .  ■' 


Wte 


«te“ 


B  (>  0)  and  8  (  >  0)  are  tuning  parameters  that  control  the  regularization  effects.  The  solution 
{ivj}"“i  is  an  estimate  of  the  importance  at  the  training  points  {xi-  }ni'^[. 

Since  KMM  does  not  involve  density  estimation,  it  is  expected  to  work  well  even  in  high  dimen¬ 
sional  cases.  However,  the  performance  is  dependent  on  the  tuning  parameters  B,  8,  and  a,  and  they 
cannot  be  simply  optimized,  for  example,  by  CV  since  estimates  of  the  importance  arc  available 
only  at  the  training  points.  A  popular  heuristic  is  to  use  the  median  distance  between  samples  as 
the  Gaussian  width  a,  which  is  shown  to  be  useful  (Scholkopf  and  Smola,  2002;  Song  et  ah,  2007). 
However,  there  seems  no  strong  justification  for  this  heuristic.  For  the  choice  of  8,  a  theoretical 
result  given  in  Huang  et  al.  (2007)  could  be  used  as  guidance,  although  it  is  still  hard  to  determine 
the  best  value  of  8  in  practice. 


5.3  Logistic  Regression 

Another  approach  to  directly  estimating  the  importance  is  to  use  a  probabilistic  classifier.  Let  us 
assign  a  selector  variable  r|  =  —  1  to  training  samples  and  r|  =  1  to  test  samples,  that  is,  the  training 
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and  test  densities  are  written  as 


Pb(x)  =  p(x\r\  =  -1), 
pte(x)=p(x  |r|  =  1). 

Note  that  r|  is  regarded  as  a  random  variable. 

Application  of  the  Bayes  theorem  yields  that  the  importance  can  be  expressed  in  terms  of  r|  as 
follows  (Qin,  1998;  Cheng  and  Chu,  2004;  Bickel  et  al.,  2007): 


w(x) 


pj Tj  =  — 1)  pQl  =  l|*) 

p(r\  =  1)  p(r\  =  -l\x)' 


The  probability  ratio  of  test  and  training  samples  may  be  simply  estimated  by  the  ratio  of  the  num¬ 
bers  of  samples: 

P( in  =  ~!)  _  n a 

p(y\  =  1)  «te’ 

The  conditional  probability  p(r\  \x)  could  be  approximated  by  discriminating  test  samples  from  train¬ 
ing  samples  using  a  logistic  regression  (LogReg)  classifier,  where  r|  plays  the  role  of  a  class  variable. 
Below  we  briefly  explain  the  LogReg  method. 

The  LogReg  classifier  employs  a  parametric  model  of  the  following  form  for  expressing  the 
conditional  probability  p{r\\x)\ 


1  +  exp  (— r|  Y!(=i  CAM)  ’ 

where  m  is  the  number  of  basis  functions  and  A(x)}™=i  arc  fixed  basis  functions.  The  parameter  C, 
is  learned  so  that  the  negative  regularized  log-likelihood  is  minimized: 


£  =  argmin 

C 


«tr 


LlQg 


^1  +exp 


«te  / 

+  L  lo§  1  +exp 

j=  1  V 


m  \ 

r=i  ) 


Since  the  above  objective  function  is  convex,  the  global  optimal  solution  can  be  obtained  by  standard 
nonlinear  optimization  methods  such  as  Newton’s  method,  the  conjugate  gradient  method,  and  the 
BFGS  method  (Minka,  2007).  Then  the  importance  estimate  is  given  by 


w(x) 


«tr 

«te 


exp 


m  \ 

£  CAM  • 
«=1  J 


(34) 


An  advantage  of  the  LogReg  method  is  that  model  selection  (that  is,  the  choice  of  the  basis 
functions  {^(jc)}" =l  as  well  as  the  regularization  parameter  A.)  is  possible  by  standard  CV  since  the 
learning  problem  involved  above  is  a  standard  supervised  classification  problem. 
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5.4  Kullback-Leibler  Importance  Estimation  Procedure 

The  Kullback-Leibler  importance  estimation  procedure  (KLIEP)  (Sugiyama  et  al.,  2008a)  also  di¬ 
rectly  gives  an  estimate  of  the  importance  function  without  going  through  density  estimation  by 
matching  the  two  distributions  in  terms  of  the  Kullback-Leibler  divergence  (Kullback  and  Leibler, 
1951). 

Let  us  model  the  importance  w(x)  by  the  1  i near  model  (1).  An  estimate  of  the  test  density  pte  (x) 
is  given  by  using  the  model  w(x)  as 


Pte(x)  =  w(x)ptr(x). 

In  KLIEP,  the  parameters  a  arc  determined  so  that  the  Kullback-Leibler  divergence  from  pte  (a)  to 
Pte(x)  is  minimized: 


KL[pteWI|PteW]  = 


f  j  (A)  loe  Pte M 
I'D  16  '  &w(x)ptr{x) 

Pte  (a) 


dx 


[  pte(x)  log  Pte  X  dx-  [  ple(x)\ogw(x)dx. 
Jd  Ptr(x)  Jd 


(35) 


The  first  term  is  a  constant,  so  it  can  be  safely  ignored.  Since  pte(x)  (=  w(x)pa(x))  is  a  probability 
density  function,  it  should  satisfy 


1  = 


/  Pie (x)dx  =  /  w(x)pu(x)dx. 
Jd  Jd 


(36) 


Then  the  KLIEP  optimization  problem  is  given  by  replacing  the  expectations  in  Eqs.  (35)  and  (36) 
with  empirical  averages  as  follows: 


«te 

max  V  log 

{«<}?=!  [Pi 


E 


b  /  n„ 

subject  to  E  E  ^ 
1=  l  \/-i 


«tr  and  ai,a2,...,a^  >  0. 


This  is  a  convex  optimization  problem  and  the  global  solution — which  tends  to  be  sparse  (Boyd  and 
Vandenberghe,  2004) — can  be  obtained,  for  example,  by  simply  performing  gradient  ascent  and 
feasibility  satisfaction  iteratively.  Model  selection  of  KLIEP  is  possible  by  LCV. 

Properties  of  KLIEP-type  algorithms  have  been  theoretically  investigated  in  Nguyen  et  al.  (2008) 
and  Sugiyama  et  al.  (2008b)  (see  also  Qin,  1998;  Cheng  and  Chu,  2004).  Note  that  the  importance 
model  of  KLIEP  is  the  linear  model  (1),  while  that  of  LogReg  is  the  log-linear  model  (34).  A  valiant 
of  KLIEP  for  log-1  i near  models  has  been  studied  in  Tsuboi  et  al.  (2008). 


5.5  Discussions 

Table  1  summarizes  properties  of  proposed  and  existing  methods. 

KDE  is  efficient  in  computation  since  no  optimization  is  involved,  and  model  selection  is  pos¬ 
sible  by  LCV.  However,  KDE  may  suffer  from  the  curse  of  dimensionality  due  to  the  difficulty  of 
density  estimation  in  high  dimensions. 
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Methods 

Density 

estimation 

Model 

selection 

Optimization 

Out-of-sample 

prediction 

KDE 

Necessary 

Available 

Analytic 

Possible 

KMM 

Not  necessary 

Not  available 

Convex  quadratic  program 

Not  possible 

LogReg 

Not  necessary 

Available 

Convex  non-linear 

Possible 

KLIEP 

Not  necessary 

Available 

Convex  non-linear 

Possible 

LSIF 

Not  necessary 

Available 

Convex  quadratic  program 

Possible 

uLSIF 

Not  necessary 

Available 

Analytic 

Possible 

Table  1:  Relation  between  proposed  and  existing  methods. 


KMM  can  potentially  overcome  the  curse  of  dimensionality  by  directly  estimating  the  impor¬ 
tance.  However,  there  is  no  objective  model  selection  method.  Therefore,  model  parameters  such  as 
the  Gaussian  width  need  to  be  determined  by  hand,  which  is  highly  unreliable  unless  we  have  strong 
prior  knowledge.  Furthermore,  the  computation  of  KMM  is  rather  demanding  since  a  quadratic  pro¬ 
gramming  problem  has  to  be  solved. 

LogReg  and  KLIEP  also  do  not  involve  density  estimation,  but  different  from  KMM,  they  give 
an  estimate  the  entire  importance  function,  not  only  the  values  of  the  importance  at  training  points. 
Therefore,  the  values  of  the  importance  at  unseen  points  can  be  estimated  by  LogReg  and  KLIER 
This  feature  is  highly  useful  since  it  enables  us  to  employ  CV  for  model  selection,  which  is  a  sig¬ 
nificant  advantage  over  KMM.  However,  LogReg  and  KLIEP  are  computationally  rather  expensive 
since  non-linear  optimization  problems  have  to  be  solved.  Note  that  the  LogReg  method  is  slightly 
different  in  motivation  from  other  methods,  but  has  some  similarity  in  computation  and  implemen¬ 
tation,  for  example,  the  LogReg  method  also  involves  a  kernel  smoother. 

The  proposed  LSIF  method  is  qualitatively  similar  to  LogReg  and  KLIEP,  that  is,  it  can  avoid 
density  estimation,  model  selection  is  possible,  and  non-linear  optimization  is  involved.  LSIF  is 
advantageous  over  LogReg  and  KLIEP  in  that  it  is  equipped  with  a  regularization  path  tracking 
algorithm.  Thanks  to  this,  model  selection  of  LSIF  is  computationally  much  more  efficient  than 
LogReg  and  KLIEP.  However,  the  regularization  path  tracking  algorithm  tends  to  be  numerically 
unstable. 

The  proposed  uLSIF  method  inherits  good  properties  of  existing  methods  such  as  no  density 
estimation  involved  and  a  build-in  model  selection  method  equipped.  In  addition  to  these  preferable 
properties,  the  solution  of  uLSIF  can  be  computed  in  an  efficient  and  numerically  stable  manner. 
Furthermore,  thanks  to  the  availability  of  the  closed-form  solution  of  uLSIF,  the  LOOCV  score  can 
be  analytically  computed  without  repeating  hold-out  loops,  which  highly  contributes  to  reducing 
the  computation  time  in  the  model  selection  phase. 

In  the  next  section,  we  experimentally  show  that  uLSIF  is  computationally  more  efficient  than 
existing  direct  importance  estimation  methods,  while  its  estimation  accuracy  is  comparable  to  the 
best  existing  methods. 

6.  Experiments 

In  this  section,  we  compare  the  experimental  performance  of  the  proposed  and  existing  methods. 
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6.1  Importance  Estimation 

Let  the  dimension  of  the  domain  be  d  and 


Ptr{x)  =  $C{X\  (0,0,...,  0  )T,Id), 
Pte(x)  =  9^(x;(l,0,...,0)T  ,Id). 


The  task  is  to  estimate  the  importance  at  training  points: 


Wi  =  w(xf) 


Pte(x?) 

Ptr{xf) 


for  i  =  1,2,.. .  ,ntr. 


We  compare  the  following  methods: 


KDE(CV):  The  Gaussian  kernel  (21)  is  used,  where  the  kernel  widths  of  the  training  and  test 
densities  arc  separately  optimized  based  on  5 -fold  LCV. 

KMM(med):  The  performance  of  KMM  is  dependent  on  B.  e,  and  a.  We  set  B  =  1000  and  e  = 
( y'Viir  —  I ) /y/iir  following  the  original  paper  (Huang  et  ah,  2007),  and  the  Gaussian  width  a  is 
set  at  the  median  distance  between  samples  within  the  training  set  and  the  test  set  (Scholkopf 
and  Srnola,  2002;  Song  et  ah,  2007). 

LogReg(CV):  The  Gaussian  kernel  model  (22)  arc  used  as  basis  functions.  The  kernel  width  a  and 
the  regularization  parameter  X  arc  chosen  based  on  5-fold  CV.1 

KLIEP(CV):  The  Gaussian  kernel  model  (22)  is  used.  The  kernel  width  a  is  selected  based  on 
5 -fold  LCV. 


uLSIF(CV):  The  Gaussian  kernel  model  (22)  is  used.  The  kernel  width  a  and  the  regularization 
parameter  X  arc  determined  based  on  LOOCV. 

All  the  methods  arc  implemented  using  the  MATLAB®  environment,  where  the  CPLEX®  opti¬ 
mizer  is  used  for  solving  quadratic  programs  in  KMM  and  the  LIBLINEAR  implementation  is  used 
for  LogReg  (Lin  et  al.,  2007). 

We  fixed  the  number  of  test  points  at  nlc  =  1000  and  consider  the  following  two  setups  for  the 
number  n[r  of  training  samples  and  the  input  dimensionality  d : 

(a)  is  fixed  at  ntr  =100  and  d  is  changed  as  d  =  1.2 . 20, 


(b)  d  is  fixed  at  d  =  10  and  ntr  is  changed  as  ntr  =  50, 60, . . . ,  150. 


We  run  the  experiments  100  times  for  each  d,  each  nt r,  and  each  method,  and  evaluate  the  quality  of 
the  importance  estimates  { vry  ,  by  the  normalized  mean  squared  error  (NMSE): 


1  «tr 

NMSE  =  —  V 
«tr  h 


Wj  \2 

LZi  wi') 


1.  In  Sugiyama  et  al.  (2008b)  where  KLIEP  has  been  proposed,  the  performance  of  LogReg  has  been  experimentally 
investigated  in  the  same  setup.  In  that  paper,  however,  LogReg  was  not  regularized  since  KLIEP  was  not  also 
regularized.  On  the  other  hand,  we  use  a  regularized  LogReg  method  and  choose  the  regularization  parameter  in 
addition  to  the  Gaussian  kernel  width  by  CV  here.  Thanks  to  the  regularization  effect,  the  results  of  LogReg  in  the 
current  paper  tends  to  be  better  than  that  reported  in  Sugiyama  et  al.  (2008b). 
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In  practice,  the  scale  of  the  importance  is  not  significant  and  the  relative  magnitude  among  w;  is 
important.  Thus  the  above  NMSE  would  be  a  suitable  error  metric  for  evaluating  the  performance 
of  each  method. 

NMSEs  averaged  over  100  trials  (a)  as  a  function  of  input  dimensionality  cl  and  (b)  as  a  function 
of  the  training  sample  size  nlr  arc  plotted  in  log  scale  in  Figure  8.  Error  bars  arc  omitted  for  clear 
visibility — instead,  the  best  method  in  terms  of  the  mean  error  and  comparable  ones  based  on  the 
t-test  at  the  significance  level  1%  arc  indicated  by  ‘o’;  the  methods  with  significant  difference  from 
the  best  methods  are  indicated  by  ‘  x  ’ . 

Figure  8(a)  shows  that  the  error  of  KDE(CV)  sharply  increases  as  the  input  dimensionality 
grows,  while  LogReg,  KLIEP,  and  uLSIF  tend  to  give  much  smaller  errors  than  KDE.  This  would 
be  the  fruit  of  directly  estimating  the  importance  without  going  through  density  estimation.  KMM 
tends  to  perform  poorly,  which  is  caused  by  an  inappropriate  choice  of  the  Gaussian  kernel  width. 
On  the  other  hand,  model  selection  in  LogReg,  KLIER  and  uLSIF  seems  to  work  quite  well.  Fig¬ 
ure  8(b)  shows  that  the  errors  of  all  methods  tend  to  decrease  as  the  number  of  training  samples 
grows.  Again  LogReg,  KLIER  and  uLSIF  tend  to  give  much  smaller  errors  than  KDE  and  KMM. 

Next  we  investigate  the  computation  time.  Each  method  has  a  different  model  selection  strategy, 
that  is,  KMM  does  not  involve  CV,  KDE  and  KLIEP  involve  CV  over  the  kernel  width,  and  LogReg 
and  uLSIF  involve  CV  over  both  the  kernel  width  and  the  regularization  parameter.  Thus  the  naive 
comparison  of  the  total  computation  time  is  not  so  meaningful.  For  this  reason,  we  first  investigate 
the  computation  time  of  each  importance  estimation  method  after  the  model  parameters  arc  fixed. 

The  average  CPU  computation  time  over  100  trials  arc  summarized  in  Figure  9.  Figure  9(a) 
shows  that  the  computation  time  of  KDE,  KLIEP,  and  uLSIF  is  almost  independent  of  the  input 
dimensionality,  while  that  of  KMM  and  LogReg  is  rather  dependent  on  the  input  dimensionality. 
Note  that  LogReg  for  d  <  3  is  slow  due  to  some  convergence  problem  of  the  LIBLINEAR  package. 
Among  them,  the  proposed  uLSIF  is  one  of  the  fastest  methods.  Figure  9(b)  shows  that  the  compu¬ 
tation  time  of  LogReg,  KLIEP,  and  uLSIF  is  nearly  independent  of  the  number  of  training  samples, 
while  that  of  KDE  and  KMM  sharply  increase  as  the  number  of  training  samples  increases. 

Both  LogReg  and  uLSIF  have  high  accuracy  and  their  computation  time  after  model  selection 
is  comparable.  Finally,  we  compare  the  entire  computation  time  of  LogReg  and  uLSIF  including 
CV,  which  is  summarized  in  Figure  10.  We  note  that  the  Gaussian  width  a  and  the  regularization 
parameter  X  arc  chosen  over  the  9x9  grid  in  this  experiment  for  both  LogReg  and  uLSIF.  Therefore, 
the  comparison  of  the  entire  computation  time  is  fair.  Figures  10(a)  and  10(b)  show  that  uLSIF  is 
approximately  5  times  faster  than  LogReg. 

Overall,  uLSIF  is  shown  to  be  comparable  to  the  best  existing  method  (LogReg)  in  terms  of  the 
accuracy,  but  is  computationally  more  efficient  than  LogReg. 

6.2  Covariate  Shift  Adaptation  in  Regression  and  Classification 

Next,  we  illustrate  how  the  importance  estimation  methods  could  be  used  in  covariate  shift  adap¬ 
tation  (Shimodaira,  2000;  Zadrozny,  2004;  Sugiyama  and  Muller,  2005;  Huang  et  al.,  2007;  Bickel 
and  Scheffer,  2007;  Bickel  et  ah,  2007;  Sugiyama  et  al.,  2007).  Co  variate  shift  is  a  situation  in 
supervised  learning  where  the  input  distributions  change  between  the  training  and  test  phase  but  the 
conditional  distribution  of  outputs  given  inputs  remains  unchanged.  Under  co variate  shift,  standard 
learning  techniques  such  as  maximum  likelihood  estimation  or  cross-validation  arc  biased — the  bias 
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5  10  15  20 

d  (Input  Dimension) 

(a)  When  input  dimensionality  is  changed 


50  100  150 

n  (Number  of  Training  Samples) 


(b)  When  training  sample  size  is  changed 

Figure  8:  NMSEs  averaged  over  100  trials  in  log  scale  for  the  artificial  data  set.  Error  bars  arc 
omitted  for  clear  visibility.  Instead,  the  best  method  in  terms  of  the  mean  error  and 
comparable  ones  based  on  the  t-test  at  the  significance  level  1%  arc  indicated  by  ‘o’;  the 
methods  with  significant  difference  from  the  best  methods  arc  indicated  by  ‘  x  ’ . 
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5  10  15  20 

d  (Input  Dimension) 

(a)  When  input  dimensionality  is  changed 


n  (Number  of  Training  Samples) 

(b)  When  training  sample  size  is  changed 

Figure  9:  Average  computation  time  (after  model  selection)  over  100  trials  for  the  artificial  data  set. 
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LogReg(CV) 
■  uLSIF(CV) 


Figure  10:  Average  computation  time  over  100  trials  for  the  artificial  data  set  (including  model 
selection  of  the  Gaussian  width  a  and  the  regularization  parameter  X  over  the  9x9 
grid). 
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caused  by  covariate  shift  can  be  asymptotically  canceled  by  weighting  the  loss  function  according 
to  the  importance. 

In  addition  to  training  input  samples  {xf  }”f  j  drawn  from  a  training  input  density  pu (x)  and  test 
input  samples  drawn  from  a  test  input  density  /v  (x),  suppose  that  we  are  given  training 

output  samples  {y-r  }”f ,  at  the  training  input  points  , .  The  task  is  to  predict  the  outputs  for 

test  inputs  based  on  the  input-output  training  samples  {(xjr,yf  )}"“j. 

We  use  the  following  kernel  model  for  function  learning: 

t 

/(y;0)  =  £9A(x)m<)( 

l=  1 


where  Kh(x,x')  is  the  Gaussian  kernel  (21)  and  /??/  is  a  template  point  randomly  chosen  from 
without  replacement.  We  set  the  number  of  kernels  at  t  =  50.  We  learn  the  parameter  0  by  impor¬ 
tance  weighted  regularized  least-squares  (IWRLS)  (Evgeniou  et  ah,  2000;  Sugiyama  and  Muller, 
2005): 


0  iwrls  =  argmin 
0 


;;,r 


EMiwei-jT]  +Yiieir 

i—  1 


(37) 


It  is  known  that  IWRLS  is  consistent  when  the  true  importance  w{x'f)  is  used  as  weights — unweighted 
RLS  is  not  consistent  due  to  co variate  shift,  given  that  the  true  learning  target  function  fix)  is  not 
realizable  by  the  model  /(x)  (Shimodaira,  2000). 

The  solution  0iwrls  is  analytically  given  by 


0  iwrls  =  (K'WK  +  ylh)-lK'Wylr. 


where 


Ki,e  =  Kh(xf,me), 

W  =  diag(w(xt1r),w(4r),---,w(<r)), 

yb  =  (ylyl...,ylr)T. 

diag  (a,b, . . .  ,c)  denotes  the  diagonal  matrix  with  the  diagonal  elements  a.h. . . .  .c. 

The  kernel  width  h  and  the  regularization  parameter  yin  IWRLS  (37)  are  chosen  by  importance 
weighted  CV  (IWCV)  (Sugiyama  et  ah,  2007).  More  specifically,  we  first  divide  the  training  samples 
{zf  |  zf  =  into  R  disjoint  subsets  {Zjr}f=1 .  Then  a  function  fr(x)  is  learned  using 

{ Z'j  \  jyr  by  IWRLS  and  its  mean  test  error  for  the  remaining  samples  Z\:  is  computed: 

riy  Yj  ^(x)loss  (fr(x),y)  , 

'  r  1  (X,y)ez? 


where 


loss  (y,y) 


(y  —  y)2  (Regression), 

\(l  —  signjyy})  (Classification). 


We  repeat  this  procedure  for  r  =  1,2, ...  ,R  and  choose  the  kernel  width  h  and  the  regularization 
parameter  y  so  that  the  average  of  the  above  mean  test  error  over  all  r  is  minimized.  We  set  the 
number  of  folds  in  IWCV  at  R  =  5.  IWCV  is  shown  to  be  an  (almost)  unbiased  estimator  of  the 
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Data 

Uniform 

KDE 

(CV) 

KMM 

(med) 

LogReg 

(CV) 

KLIEP 

(CV) 

uLSIF 

(CV) 

kin-8fh 

1.00(0.34) 

1.22(0.52) 

1.55(0.39) 

1.31(0.39) 

0.95(0.31) 

1.02(0.33) 

kin-8fm 

1.00(0.39) 

1.12(0.57) 

1.84(0.58) 

1.38(0.57) 

0.86(0.35) 

0.88(0.39) 

kin-8nh 

1.00(0.26) 

1.09(0.20) 

1.19(0.29) 

1.09(0.19) 

0.99(0.22) 

1.02(0.18) 

kin-8nm 

1.00(0.30) 

1.14(0.26) 

1.20(0.20) 

1.12(0.21) 

0.97(0.25) 

1.04(0.25) 

abalone 

1.00(0.50) 

1.02(0.41) 

0.91(0.38) 

0.97(0.49) 

0.94(0.67) 

0.96(0.61) 

image 

1.00(0.51) 

0.98(0.45) 

1.08(0.54) 

0.98(0.46) 

0.94(0.44) 

0.98(0.47) 

ringnorm 

1.00(0.04) 

0.87(0.04) 

0.87(0.04) 

0.95(0.08) 

0.99(0.06) 

0.91(0.08) 

twonorm 

1.00(0.58) 

1.16(0.71) 

0.94(0.57) 

0.91(0.61) 

0.91(0.52) 

0.88(0.57) 

waveform 

1.00(0.45) 

1.05(0.47) 

0.98(0.31) 

0.93(0.32) 

0.93(0.34) 

0.92(0.32) 

Average 

1.00(0.38) 

1.07(0.40) 

1.17(0.37) 

1.07(0.37) 

0.94(0.35) 

0.96(0.36) 

Comp,  time 

— 

0.82 

3.50 

3.27 

2.23 

1.00 

Table  2:  Mean  test  error  averaged  over  100  trials  for  covariate  shift  adaptation  in  regression  and 
classification.  The  numbers  in  the  brackets  arc  the  standard  deviation.  All  the  error  values 
arc  normalized  by  that  of  ‘Uniform’  (uniform  weighting,  or  equivalently  no  importance 
weighting).  For  each  data  set,  the  best  method  in  terms  of  the  mean  error  and  comparable 
ones  based  on  the  Wilcoxon  signed  rank  test  at  the  significance  level  1%  arc  described  in 
bold  face.  The  upper  half  corresponds  to  regression  data  sets  taken  from  DELVE  (Ras¬ 
mussen  et  al.,  1996),  while  the  lower  half  correspond  to  classification  data  sets  taken  from 
IDA  (Ratsch  et  al.,  2001).  All  the  methods  arc  implemented  using  the  MATLAB®  environ¬ 
ment,  where  the  CPLEX®  optimizer  is  used  for  solving  quadratic  programs  in  KMM  and 
the  LIBLINEAR  implementation  is  used  for  LogReg  (Lin  et  al.,  2007). 


generalization  error,  while  unweighted  CV  with  misspecified  models  is  biased  due  to  covariate  shift 
(Zadrozny,  2004;  Sugiyama  et  al.,  2007). 

The  data  sets  provided  by  DELVE  (Rasmussen  et  al.,  1996)  and  IDA  (Ratsch  et  al.,  2001) 
are  used  for  performance  evaluation.  Each  data  set  consists  of  input/output  samples 
We  normalize  all  the  input  samples  into  [0, 1 ]d  and  choose  the  test  samples  {(xj  ■yJ)Yj^l 

from  the  pool  { (.LuV) }”_]  as  follows.  We  randomly  choose  one  sample  Dy , )  from  the  pool  and 
accept  this  with  probability  min(l,4(v[^)2),  where  xk  1  is  the  c-th  element  of  Xk  and  c  is  randomly 
determined  and  fixed  in  each  trial  of  the  experiments.  Then  we  remove  A'/(  from  the  pool  regardless 
of  its  rejection  or  acceptance,  and  repeat  this  procedure  until  «te  samples  are  accepted.  We  choose 
the  training  samples  {(xf  ,y?)}"=i  uniformly  from  the  rest.  Thus,  in  this  experiment,  the  test  input 
density  tends  to  be  lower  than  the  training  input  density  when  xk  is  small.  We  set  the  number  of 
samples  at  =  100  and  nlc  =  500  for  all  data  sets.  Note  that  we  only  use  j  and  {xJ}njLi 

for  training  regressors  or  classifiers;  the  test  output  values  j  are  used  only  for  evaluating  the 

generalization  performance. 

We  run  the  experiments  100  times  for  each  data  set  and  evaluate  the  mean  test  error : 

T|i„ss(/(a-«).j«). 
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The  results  arc  summarized  in  Table  2,  where  ‘Uniform’  denotes  uniform  weights  (or  equivalently, 
no  importance  weight).  The  numbers  in  the  brackets  arc  the  standard  deviation.  All  the  error  values 
arc  normalized  so  that  the  mean  error  of  Uniform  is  one.  For  each  data  set,  the  best  method  in  terms 
of  the  mean  error  and  comparable  ones  based  on  the  Wilcoxon  signed  rank  test  at  the  significance 
level  1%  arc  described  in  bold  face.  The  upper  half  of  the  table  corresponds  to  regression  data  sets 
taken  from  DELVE  (Rasmussen  et  ah,  1996),  while  the  lower  half  correspond  to  classification  data 
sets  taken  from  IDA  (Ratsch  et  al.,  2001).  All  the  methods  arc  implemented  using  the  MAT  LAB  k 
environment,  where  the  CPLEX®  optimizer  is  used  for  solving  quadratic  programs  in  KMM  and 
the  LIBLINEAR  implementation  is  used  for  LogReg  (Lin  et  al.,  2007). 

The  table  shows  that  the  generalization  performance  of  uLSIF  tends  to  be  better  than  that  of 
Uniform,  KDE,  KMM,  and  LogReg,  while  it  is  comparable  to  the  best  existing  method  (KLIEP). 
The  mean  computation  time  over  100  trials  is  described  in  the  bottom  row  of  the  table,  where  the 
value  is  normalized  so  that  the  computation  time  of  uLSIF  is  one.  This  shows  that  the  computation 
time  of  uLSIF  is  much  shorter  than  KLIER  Thus,  uLSIF  is  overall  shown  to  be  useful  in  covariate 
shift  adaptation. 

6.3  Outlier  Detection 

Finally,  we  apply  importance  estimation  methods  in  outlier  detection. 

Here,  we  consider  an  outlier  detection  problem  of  finding  irregular  samples  in  a  data  set  (“eval¬ 
uation  data  set”)  based  on  another  data  set  (“model  data  set”)  that  only  contains  regular  samples. 
Defining  the  importance  over  two  sets  of  samples,  we  can  see  that  the  importance  values  for  regular 
samples  are  close  to  one,  while  those  for  outliers  tend  to  be  significantly  deviated  from  one.  Thus 
the  importance  values  could  be  used  as  an  index  of  the  degree  of  outlyingness  in  this  scenario.  Since 
the  evaluation  data  set  has  wider  support  than  the  model  data  set,  we  regal'd  the  evaluation  data  set 
as  the  training  set  {jef  }^j  (that  is,  the  denominator  in  the  importance)  and  the  model  data  set  as 
the  test  set  {Vfl'V,  (that  is,  the  numerator  in  the  importance).  Then  outliers  tend  to  have  smaller 
importance  values  (that  is,  close  to  zero). 

We  again  test  KMM(med),  LogReg(CV),  KLIEP(CV),  and  uLSIF(CV)  for  importance  estima¬ 
tion;  in  addition,  we  include  native  outlier  detection  methods  for  comparison  purposes.  The  outlier 
detection  problem  that  the  native  methods  used  below  solve  is  to  find  outliers  in  a  single  data  set 
{xkYl_\ — the  native  methods  can  be  employed  in  the  current  scenario  just  by  finding  outliers  from 
all  samples: 


{**} 


n  _ 

k=  1  — 


«te 

.Al¬ 


one-class  support  vector  machine  (OSVM):  The  support  vector  machine  (SVM)  (Vapnik,  1998; 
Scholkopf  and  Smola,  2002)  is  one  of  the  most  successful  classification  algorithms  in  machine 
learning.  The  core  idea  of  SVM  is  to  separate  samples  in  different  classes  by  the  maximum 
margin  hyperplane  in  a  kernel-induced  feature  space. 

OSVM  is  an  extension  of  SVM  to  outlier  detection  (Scholkopf  et  al.,  2001).  The  basic  idea 
of  OSVM  is  to  separate  data  samples  {xi<}nk_[  into  outliers  and  inliers  by  a  hyperplane  in  a 
Gaussian  reproducing  kernel  Hilbert  space.  More  specifically,  the  solution  of  OSVM  is  given 
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as  the  solution  of  the  following  convex  quadratic  programming  problem: 

1  " 

min  T  Y  WkWk'Ka(xk,xk' ) 

iwk}"k=l  2ktf=i 

n 

subject  to  YWk  =  1  and  0<wi,W2, 

k= 1 

where  v  (0  <  V  <  1)  is  the  maximum  fraction  of  outliers. 

We  use  the  inverse  distance  of  a  sample  from  the  separating  hyperplane  as  an  outlier  score. 
The  OSVM  solution  is  dependent  on  the  outlier  ratio  v  and  the  Gaussian  kernel  width  G,  and 
there  seems  to  be  no  systematic  method  to  determine  the  values  of  these  tuning  parameters. 
Here  we  use  the  median  distance  between  samples  as  the  Gaussian  width,  which  is  a  popular 
heuristic  (Scholkopf  and  Smola,  2002;  Song  et  ah,  2007).  The  value  of  v  is  fixed  at  the  true 
output  ratio,  that  is,  the  ideal  optimal  value.  Thus  the  simulation  results  below  should  be 
slightly  in  favor  of  OSVM. 


...,w„  <  — , 
Vzz 


Local  outlier  factor  (LOF):  LOF  is  the  score  to  detect  a  local  outlier  which  lies  relatively  far  from 
the  nearest  dense  region  (Breunig  et  al.,  2000).  For  a  prefixed  natural  number  k,  the  LOF  value 
of  a  sample  x  is  defined  by 


LOF*(x) 


1  imd,t  (nearest;  (x)) 

kf~ |  imd4(x) 


where  nearest;  (x)  denotes  the  z'-th  nearest  neighbor  of  x  and  imd/((x)  denotes  the  inverse  mean 
distance  from  x  to  its  k  nearest  neighbors: 


imdfc(x) 


1 

| |x- nearest,- (x) 1 1 


If  x  alone  is  apart  from  a  cloud  of  points,  imd^.(x)  tends  to  become  smaller  than  than  inid*.  (nearest;  (x)) 
for  all  i.  Then  the  LOF  value  gets  large  and  therefore  such  a  point  is  regarded  as  an  outlier. 

The  performance  of  LOF  depends  on  the  choice  of  the  parameter  k  and  there  seems  no  sys¬ 
tematic  way  to  find  an  appropriate  value  of  k.  Here  we  test  several  different  values  of  k. 

Kernel  density  estimator  (KDE’):  A  naive  density  estimation  of  all  data  samples  {xk}nk_]  can  also 
be  used  for  outlier  detection  since  the  density  value  itself  could  be  regarded  as  an  outlier  score. 

We  use  KDE  with  the  Gaussian  kernel  (21)  for  density  estimation,  where  the  kernel  width  is 
determined  based  on  5-fold  LCV. 


All  the  methods  are  implemented  using  the  R  environment — we  use  the  ksvm  routine  in  the 
kernlab  package  for  OSVM  (Karatzoglou  et  al.,  2004)  and  the  lof actor  routine  in  the  dprep  package 
for  LOF  (Fernandez,  2005). 

The  data  sets  provided  by  IDA  (Ratsch  et  al.,  2001)  are  used  for  performance  evaluation.  These 
data  sets  arc  binary  classification  data  sets  consisting  of  positive/negative  and  training/test  samples. 
We  allocate  all  positive  training  samples  for  the  “model”  set,  while  all  positive  test  samples  and  a 
fraction  p  (=  0.01,0.02,0.05)  of  negative  test  samples  are  assigned  in  the  “evaluation”  set.  Thus, 
we  regal'd  the  positive  samples  as  regular  and  the  negative  samples  as  irregular. 
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In  the  evaluation  of  the  performance  of  outlier  detection  methods,  it  is  important  to  take  into 
account  both  the  detection  rate  (the  amount  of  true  outliers  an  outlier  detection  algorithm  can  find) 
and  the  detection  accuracy  (the  amount  of  true  inliers  that  an  outlier  detection  algorithm  misjudges 
as  outliers).  Since  there  is  a  trade-off  between  the  detection  rate  and  the  detection  accuracy,  we 
adopt  the  area  under  the  ROC  curve  (AUC)  as  our  error  metric  (Bradley,  1997). 

The  mean  AUC  values  over  20  trials  as  well  as  the  computation  time  are  summarized  in  Table  3, 
showing  that  uLSIF  works  fairly  well.  KLIEP  works  slightly  better  than  uLSIF,  but  uLSIF  is  com¬ 
putationally  much  more  efficient.  LogReg  overall  works  reasonably  well,  but  it  performs  poorly  for 
some  data  sets  and  the  average  AUC  performance  is  not  as  good  as  uLSIF  or  KLIER  KMM  and 
OSVM  are  not  comparable  to  uLSIF  in  AUC  and  they  are  computationally  inefficient.  Note  that  we 
also  tested  KMM  and  OSVM  with  several  different  Gaussian  widths  and  experimentally  found  that 
the  heuristic  of  using  the  median  sample  distance  as  the  Gaussian  kernel  width  works  reasonably 
well  in  this  experiment.  Thus  the  AUC  values  of  KMM  and  OSVM  are  close  to  optimal.  LOF  with 
large  k  is  shown  to  work  well,  although  it  is  not  clear  whether  the  heuristic  of  simply  using  large  k 
is  always  appropriate  or  not.  The  computational  cost  of  LOF  is  high  since  nearest  neighbor  search 
is  computationally  expensive.  KDE'  works  reasonably  well,  but  its  performance  is  not  as  good  as 
uLSIF  and  KLIEP. 

Overall,  uLSIF  is  shown  to  work  well  with  low  computational  costs. 

7.  Conclusions 

The  importance  is  useful  in  various  machine  learning  scenarios  such  as  covariate  shift  adaptation 
and  outlier  detection.  In  this  paper,  we  proposed  a  new  method  of  importance  estimation  that  can 
avoid  solving  a  substantially  more  difficult  task  of  density  estimation.  We  formulated  the  importance 
estimation  problem  as  least-squares  function  fitting  and  casted  the  optimization  problem  as  a  convex 
quadratic  program  (we  referred  to  it  as  LSIF).  We  theoretically  elucidated  the  convergence  property 
of  LSIF  and  showed  that  the  entire  regularization  path  of  LSIF  can  be  efficiently  computed  based 
on  a  parametric  optimization  technique.  We  further  developed  an  approximation  algorithm  (we 
called  it  uLSIF),  which  allows  us  to  obtain  the  closed-form  solution.  We  showed  that  the  leave-one- 
out  cross-validation  score  can  be  computed  analytically  for  uLSIF — this  makes  the  computation  of 
uLSIF  highly  efficient.  We  carried  out  extensive  simulations  in  covariate  shift  adaptation  and  outlier 
detection,  and  experimentally  confirmed  that  the  proposed  uLSIF  is  computationally  more  efficient 
than  existing  approaches,  while  the  accuracy  of  uLSIF  is  comparable  to  the  best  existing  methods. 
Thanks  to  the  low  computational  cost,  uLSIF  would  be  highly  scalability  to  large  data  sets,  which 
is  very  important  in  practical  applications. 

We  have  given  convergence  proofs  for  LSIF  and  uLSIF.  A  possible  future  direction  to  pursue 
along  this  line  is  to  show  the  convergence  of  LSIF  and  uLSIF  in  non-parametric  cases,  for  example, 
following  Nguyen  et  al.  (2008)  and  Sugiyama  et  al.  (2008b).  We  are  currently  exploring  various 
possible  applications  of  important  estimation  methods  beyond  co variate  shift  adaptation  or  outlier 
detection,  for  example,  feature  selection,  conditional  distribution  estimation,  independent  compo¬ 
nent  analysis,  and  dimensionality  reduction — we  believe  that  importance  estimation  could  be  used 
as  a  new  versatile  tool  in  statistical  machine  learning. 
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Data 

uLSIF 

(CV) 

KLIEP 

(CV) 

LogReg 

(CV) 

KMM 

(med) 

OSVM 

(med) 

LOF 

KDE’ 

(CV) 

Name 

P 

k  =  5 

k  =  30 

k  =  50 

banana 

0.01 

0.851 

0.815 

0.447 

0.578 

0.360 

0.838 

0.915 

0.919 

0.934 

0.02 

0.858 

0.824 

0.428 

0.644 

0.412 

0.813 

0.918 

0.920 

0.927 

0.05 

0.869 

0.851 

0.435 

0.761 

0.467 

0.786 

0.907 

0.909 

0.923 

b-cancer 

0.01 

0.463 

0.480 

0.627 

0.576 

0.508 

0.546 

0.488 

0.463 

0.400 

0.02 

0.463 

0.480 

0.627 

0.576 

0.506 

0.521 

0.445 

0.428 

0.400 

0.05 

0.463 

0.480 

0.627 

0.576 

0.498 

0.549 

0.480 

0.452 

0.400 

diabetes 

0.01 

0.558 

0.615 

0.599 

0.574 

0.563 

0.513 

0.403 

0.390 

0.425 

0.02 

0.558 

0.615 

0.599 

0.574 

0.563 

0.526 

0.453 

0.434 

0.425 

0.05 

0.532 

0.590 

0.636 

0.547 

0.545 

0.536 

0.461 

0.447 

0.435 

f-solar 

0.01 

0.416 

0.485 

0.438 

0.494 

0.522 

0.480 

0.441 

0.385 

0.378 

0.02 

0.426 

0.456 

0.432 

0.480 

0.550 

0.442 

0.406 

0.343 

0.374 

0.05 

0.442 

0.479 

0.432 

0.532 

0.576 

0.455 

0.417 

0.370 

0.346 

german 

0.01 

0.574 

0.572 

0.556 

0.529 

0.535 

0.526 

0.559 

0.552 

0.561 

0.02 

0.574 

0.572 

0.556 

0.529 

0.535 

0.553 

0.549 

0.544 

0.561 

0.05 

0.564 

0.555 

0.540 

0.532 

0.530 

0.548 

0.571 

0.555 

0.547 

heart 

0.01 

0.659 

0.647 

0.833 

0.623 

0.681 

0.407 

0.659 

0.739 

0.638 

0.02 

0.659 

0.647 

0.833 

0.623 

0.678 

0.428 

0.668 

0.746 

0.638 

0.05 

0.659 

0.647 

0.833 

0.623 

0.681 

0.440 

0.666 

0.749 

0.638 

satimage 

0.01 

0.812 

0.828 

0.600 

0.813 

0.540 

0.909 

0.930 

0.896 

0.916 

0.02 

0.829 

0.847 

0.632 

0.861 

0.548 

0.785 

0.919 

0.880 

0.898 

0.05 

0.841 

0.858 

0.715 

0.893 

0.536 

0.712 

0.895 

0.868 

0.892 

splice 

0.01 

0.713 

0.748 

0.368 

0.541 

0.737 

0.765 

0.778 

0.768 

0.845 

0.02 

0.754 

0.765 

0.343 

0.588 

0.744 

0.761 

0.793 

0.783 

0.848 

0.05 

0.734 

0.764 

0.377 

0.643 

0.723 

0.764 

0.785 

0.777 

0.849 

thyroid 

0.01 

0.534 

0.720 

0.745 

0.681 

0.504 

0.259 

0.111 

0.071 

0.256 

0.02 

0.534 

0.720 

0.745 

0.681 

0.505 

0.259 

0.111 

0.071 

0.256 

0.05 

0.534 

0.720 

0.745 

0.681 

0.485 

0.259 

0.111 

0.071 

0.256 

titanic 

0.01 

0.525 

0.534 

0.602 

0.502 

0.456 

0.520 

0.525 

0.525 

0.461 

0.02 

0.496 

0.498 

0.659 

0.513 

0.526 

0.492 

0.503 

0.503 

0.472 

0.05 

0.526 

0.521 

0.644 

0.538 

0.505 

0.499 

0.512 

0.512 

0.433 

twonorm 

0.01 

0.905 

0.902 

0.161 

0.439 

0.846 

0.812 

0.889 

0.897 

0.875 

0.02 

0.896 

0.889 

0.197 

0.572 

0.821 

0.803 

0.892 

0.901 

0.858 

0.05 

0.905 

0.903 

0.396 

0.754 

0.781 

0.765 

0.858 

0.874 

0.807 

waveform 

0.01 

0.890 

0.881 

0.243 

0.477 

0.861 

0.724 

0.887 

0.889 

0.861 

0.02 

0.901 

0.890 

0.181 

0.602 

0.817 

0.690 

0.887 

0.890 

0.861 

0.05 

0.885 

0.873 

0.236 

0.757 

0.798 

0.705 

0.847 

0.874 

0.831 

Average 

0.661 

0.685 

0.530 

0.608 

0.596 

0.594 

0.629 

0.622 

0.623 

Comp,  time 

1.00 

11.7 

5.35 

751 

12.4 

85.5 

8.70 

Table  3:  Mean  AUC  values  for  outlier  detection  over  20  trials  for  the  benchmark  data  sets.  All  the 
methods  arc  implemented  using  the  R  environment,  where  quadratic  programs  in  KMM 
arc  solved  by  the  ipop  optimizer  (Karatzoglou  et  al.,  2004),  the  ksvm  routine  is  used  for 
OSVM  (Karatzoglou  et  al.,  2004),  and  the  lof actor  routine  is  used  for  LOF  (Fernandez, 
2005). 
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Appendix  A.  Existence  of  the  Inverse  Matrix  of  G 

Here  we  prove  Lemma  1 . 

Let  us  consider  the  following  system  of  linear  equations: 


where  x  and  y  arc  b-  and  |J?| -dimensional  vectors,  respectively.  Lrom  the  upper  half  of  Eq.  (38),  we 
have 

x  =  H~lETy. 

Substituting  this  into  the  lower  half  of  Eq.  (38),  we  have 

EH~1ETy  =  0|S|. 

From  the  definition,  the  rank  of  the  matrix  E  is  |  EL\,  that  is,  £  is  a  row-full  rank  matrix.  As  a  result, 
the  matrix  EH  * E  is  invertible.  Therefore,  Eq.  (38)  has  the  unique  solution  x  =  0/,  and  y  =  0  ^  . 
This  implies  that  G  is  invertible. 

Appendix  B.  Active  Set  of  LSIF 

Here,  we  prove  Theorem  2. 

We  prove  that  the  active  set  EL  does  not  change  under  the  infinitesimal  shift  of  H  and  h  if  the 
strict  complementarity  condition  is  satisfied.  We  regard  the  pair  of  a  symmetric  matrix  and  a  vector 
[H' ,h')  as  an  element  in  the  (MAtli  _|_£) -dimensional  Euclidean  space.  We  consider  the  following 
lineal-  equation: 

( a'\  _  (  H'  -ETylfh'-Xlb\ 

U 7  \~E  °\*\x\*\)  V  0|*|  )’ 

where  E  is  the  \EL\  X  b  indicator  matrix  determined  from  the  active  set  EL  (see  Section  2.3  for  the 
detailed  definition).  If  H'  =  H  and  h'  =  h  hold,  the  solution  (a7,^7)  =  (a*(A,),£*(A,))  satisfies 

a'=(U'>0,  ^eEL, 
a'  >  0,  5'  =  0,  0  EL, 

because  of  the  strict  complementarity  condition.  On  the  other  hand,  if  the  norm  of  (. H',h ')  —  (. H,h ) 
is  infinitesimal,  the  solution  (a/.q  )  also  satisfies  Eq.  (39)  because  of  the  continuity  of  the  relation 
between  (. H',h ')  and  (a/,^/). 
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b(b+ 1)  , 

As  a  result,  there  exists  an  £-ball  Be  in  R  2  0  such  that  the  equality  A  =  {/:'  |  a,  =  0}  holds 

for  any  (H1  ,h')  6  Be.  Therefore,  we  have  P(5l  7^  Si)  <  P((H,h)  0  Be).  Due  to  the  large  deviation 
principle  (Dembo  and  Zeitouni,  1998),  there  is  a  positive  constant  c  such  that 


1 

min{nu,nle} 


logP((H,h)£Be)>c>0, 


if  min{«tn«te}  is  large  enough.  Thus,  asymptotically  P(M  /  J4)  <  e  cmin{ntr,«te}  holds. 


Appendix  C.  Learning  Curve  of  LSIF 


Here,  we  prove  Theorem  3. 

Let  us  consider  the  ideal  problem  (7).  Let  a*  (k)  and  g'r(Z)  be  the  optimal  parameter  and  La¬ 
grange  multiplier  (that  is,  the  KKT  conditions  are  fulfilled;  see  Section  2.3)  and  let  g  ,(Z)  be  the 
vector  of  non-zero  elements  of  £, *(k )  defined  in  the  same  way  as  Eq.  (11).  Then  cx'(X)  and  q'  (X) 
satisfy 


G 


f  h  —  Xl/A 

V  Opt,  )' 


(40) 


where 


-Et 

0\A\x\A\ 


From  the  central  limit  theorem  and  the  assumption  (18),  we  have 


h  —  h  T  Op 


(41) 


where  Op  and  op  denote  the  asymptotic  order  in  probability.  The  assumption  (a)  implies  that  the 
equality 

E=E  (42) 

holds  with  exponentially  high  probability  due  to  Theorem  2.  Note  that  G  is  the  same  size  as  G  if 
E  =  E.  Thus  we  have 

G  =  G  +  8G, 

where 


SG=f 

\G\a\xb  0'|J?|X|_3| 

m=H-H. 

Combining  Eqs.  (12),  (40),  (41),  and  (42),  we  have 


a(k) 


=  G  G 


«*(*) 

I  i  °p 


£&)J  ”  "VrWy  ■  vntt/ 

The  matrix  Taylor  expansion  (Petersen  and  Pedersen,  2007)  yields 

G  '  =  G^1  -G  l8GG~l  +G  l8GG 'SGG’1  -  ■ 


(43) 


(44) 


(45) 
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and  the  central  limit  theorem  asserts  that 


5  H  =  On 


Combining  Eqs.  (44),  (45),  (14),  and  (46),  we  have 


5a  =  a(X)  -  a*(X) 

=  -A&Ha*  {X)  +  ASHAbHa*  (X)  +  o 


Through  direct  calculation,  we  can  confirm  that 


Similar  to  Eq.  (15),  it  holds  that 


From  Eqs.  (49)  and  (50),  we  have 


Eqs.  (43),  (4),  and  (3)  imply 


AHA  =A. 


a*(X)  =  A{h-X\b). 


A(Ha*(X)-h)  =  -XA  lb. 


E[Stf]  =  Oh> 


From  Eqs.  (2)  and  (47),  we  have 


J(  a(X))  =  J(a*(k))  +  -8aTH3a  +  {Ha*(X)-h)T8a. 


From  Eqs.  (46),  (48),  and  (49),  we  have 


E  SaT//5a  =  tr (H  E  5a5aT  ) 

=  tr  (AHA  E  \{5Ha*(X))(8Ha*(X))r] )  +o  (  — 
L  J  \ntr 

=  tr(A  E  r(5//a*(X))(5//a*(X))Tj  )  +  o( — )  . 

L  J  V  fltr  / 


From  Eqs.  (48),  (51),  and  (52),  we  have 


SaT  (Ha*  (X)  —  /j)  j  =  -E  [(Stfa*(X)  -  8HA8Ha*{X))TA(Ha*(X) -h)]  +o(- 
L  J  L  J  \nt 

=E  ( 8Hu*(X)-SHA5Ha*(X))TXAlb 

=  -Xti(AE\(bHa* {X)){dHAlb)T])  +  o  ( —  )  . 

L  J  V  n,r  / 
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Combining  Eqs.  (53),  (54),  and  (55),  we  have 


E[/(a(A,))]  =J(a*{X))  +  —  tr(AE  [(vW8//a*(X))(vW5//a*(X))1 


- tr(AE  (Wn^bHa*  (X))(^/n^5HAlb) 

«tr  L 


i  +  o  |  — 

«tr 


According  to  the  central  limit  theorem,  y+ti-5//,  ,  asymptotically  follows  the  normal  distribution 
with  mean  zero  and  variance 

j  cp?  (x)<pj  (x)ptr(x)dx  —  H?j, 

and  the  asymptotic  covariance  between  y/n^SHij  and  is  given  by 


Then  we  have 


lim  E 


lim  E 

Wtr — ^°° 


(y/n^r3H  a*  (X))(y/n^8H  a*  (X))T 
(^8Ha*(X))(^8HAlb)T 


—  Cw*,w* 

—  C>v*  V) 


where  Cwy  is  the  b  x  b  covariance  matrix  with  the  (L  (:,)-tli  element  being  the  covariance  between 
w(x)q>f  (x)  and  w'(x)i p^/ (jc)  under  ptt(x).  Then  we  obtain  Eq.  (19). 


Appendix  D.  Regularization  Path  of  LSIF 

Here,  we  derive  the  regularization  path  tracking  algorithm  given  in  Figure  1 . 

When  X  is  greater  than  or  equal  to  max/Jp.,  the  solution  of  the  KKT  conditions  (9) — (10)  is 
provided  as  a  =  0/,,  c,  =  X  I  b  —  h  >  0b.  Therefore,  the  initial  value  of  Xq  is  max/;  h\,  and  the  corre¬ 
sponding  optimal  solution  is  a(Ao)  =  0/,. 

W  ^ 

Since  £,  (X)  coiTesponds  to  non-zero  elements  of  £(A,)  as  shown  in  Eq.  (11),  we  have 

^  (o  otherwise. 

When  X  is  decreased,  the  solutions  a(X)  and  c,(X)  still  satisfy  Eqs.  (12)  and  (56)  as  long  as  the 
active  set  kA  remains  unchanged.  Change  points  of  the  active  set  can  be  found  by  examining  the  non¬ 
negativity  conditions  of  a(X)  and  c,(X)  as  follows.  Suppose  X  is  decreased  and  the  non-negativity 
condition 


(m 


>  02 1, 


is  violated  at  X  =  XX  That  is,  both  a(X')  >  0/,  and  q(/C)  >  ()/,  hold,  and  either  a(X'  —  e)  >  0/,  or 
W-e)  >  0 b  is  violated  for  any  £  >  0.  If  tij(X')  =  0  for  j  (jL  A,  j  should  be  added  to  the  active  set 
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A\  on  the  other  hand,  if  c,j{X')  =  0  for  some  j  e  A,  6 lj(X')  will  take  a  positive  value  and  therefore 
j  should  be  removed  from  the  active  set  A.  Then,  for  the  updated  active  set,  we  compute  the 
solutions  by  Eqs.  (12)  and  (56).  Iterating  this  procedure  until  X  reaches  zero,  we  can  obtain  the 
entire  regularization  path. 

Note  that  we  omitted  some  minor  exceptional  cases  for  the  sake  of  simplicity — treatments  for  all 
possible  exceptions  and  the  rigorous  convergence  property  arc  exhaustively  studied  in  Best  (1982). 


Appendix  E.  Negative  Index  Set  of  (3°(A,) 


Here  we  prove  Theorem  4. 

As  explained  in  Appendix  B,  we  regard  the  pair  of  a  symmetric  matrix  and  a  vector  (. H',h ')  as 
an  element  in  the  ( _|_  /?) -dimensional  Euclidean  space. 

We  consider  the  linear  equation 

P'  =  (H' +  Xlb)-xh' . 

Due  to  the  assumption,  for  H'  =  H  and  h!  =  h,  we  have 


(3^0,  £=1,2,..., b.  (57) 

On  the  other  hand,  if  the  norm  of  (. H',h 7)  —  ( H ,h )  is  infinitesimal,  the  solution  p7  also  satisfies 
Eq.  (57),  and  the  sign  of  p7?  is  same  as  that  of  P^  for  7  =  1,2 because  of  the  continuity  of  the 
relation  between  (. H',h ')  and  p7. 

_  b(b+ 1)  ,  i  _  ~ 

As  a  result,  there  exists  an  e-ball  fie  in  R  2  such  that  the  equality  “B  =  holds  for  any 
{H' ,h!)  G  Be.  Therefore,  we  have  P(  ‘B  /  B)  <  P((H .h)  0  Be).  Due  to  the  large  deviation  principle 
(Dembo  and  Zeitouni,  1998),  there  is  a  positive  constant  c  such  that 


1 

min{ntr,«te} 


logP({H,h)£Be)>c>0, 


if  min{ntr,«te}  is  large  enough.  Thus,  asymptotically  P(B  ^  B)  <  e  cmin{«tr,«te}  holds. 


Appendix  F.  Learning  Curve  of  uLSIF 

Here,  we  prove  Theorem  5. 

Let 

Bx  =  H  +  Xlh. 

The  matrix  Taylor  expansion  (Petersen  and  Pedersen,  2007)  yields 

Bx *  =  B~ 1  -  Bx 1 5  HBX 1  +  Bx 1 87/fir 1  hHBy  1 - . 

Let  B  C  {1,2, . . .  ,b}  be  the  set  of  negative  indices  of  P(X),  that  is, 


B  =  {1\  p;(X)  <  0,  7=1,2 

Let  D  be  the  ^-dimensional  diagonal  matrix  with  the  7-th  diagonal  element 

a  Jo 

1  otherwise. 


(58) 
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The  assumption  (a)  implies  that  the  equality 


D  —  D 


(59) 


holds  with  exponentially  high  probability  due  to  Theorem  4.  Combining  Eqs.  (59),  (41),  (58),  and 
(24),  we  have 


sp  =  p(X)-p*W 


=  DB,  h  —  DB7  h 


=  -DBxlM$0(X)+DB7xbHBxlW$°(X)  +  o 


From  Eqs.  (46)  and  (60),  we  have 


E 


8pT//5p  =  ti(Bx1DHDBx1E  (5//p°(X))(8//p°(X))_ 


«tr 


+  o  (  — 

«tr 


From  Eqs.  (52)  and  (24),  we  have 
E 


8pT(//p*(X)-/i)  =E  (-8H$0(X)  +  8HBxl8H$°(l))' BxlD(H$*(l)-h) 


+  o 


=E 


«tr 

5-1 


tr(Bll(m$0(k))(bHBxlD(H$*(X)-h)) 


+  o  |  — 

«tr 


Combining  Eqs.  (53),  (61),  and  (62),  we  have 


E 


J(mj\  =JW*m  +  ^-tr (B^DHDB^  E[(^8//p°(X))(vW8//P°(X))1 

ZlllY 

1 


+  -ti(BxlEl(^MV°(mV^SHBxlD(HV*(l)-h))T)  +  o  -  )  . 

Utr  \Utr, 


1 


According  to  the  central  limit  theorem,  we  have 


lim  E[(vM;8//p°(X))(V^8//p0(X))T]  =CH 


ntr- 


lim  E[(v/^8//p°(X))(V^5//B-1D(//p*(?i)-/i))T]  =CM 

ntr^oo  A 


Then  we  obtain  Eq.  (25). 


(60) 


(61) 


(62) 


Appendix  G.  ‘Norm’  Upper  Bound  of  Approximation  Error  for  uLSIF 

Here  we  prove  Theorem  6. 

Using  the  weighted  norm  (27),  we  can  express  diff(X)  as 


diff(X)  = 


infv>olia(V)-p(X)|| 


H 
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As  shown  in  Appendix  D.  a  (A/)  =  ()/,  holds  for  some  large  A.  Then  we  immediately  have 

diff(x)<  ims-  . 

E£i««;PM) 

which  proves  Eq.  (28).  Let  Kmax  be  the  largest  eigenvalue  of  H .  Then  ||  (3(A,)  ||^  can  be  upper  bounded 

wmh  <  VwlIPWIh  <  VwlIPWIb, 


as 


where  the  first  inequality  may  be  confirmed  by  eigen-decomposing  H  and  the  second  inequality  is 
clear  from  the  definitions  of  {3(A)  and  [3(A).  Let  Kmin  be  the  smallest  eigenvalue  of  H.  Then  an  upper 
bound  of  ||  (3(A)  ||  \  is  given  as 


1 


IIPWII2  =  h  {H  +  \Ih)-zli<  1 \\h\\i  < 


1 


12  5 


where  the  last  inequality  follows  from  Kmin  >  0  . 
Now  we  have 


wmw 


H 


< 


1 


\/ttmax  1 1  h  1 1 2 


\/Kmax  ||/i  H2 


l£iEfci<p«W)P(W/IIPWIIi  X||p(X)||, 


Lor  the  denominator  of  the  above  expression,  we  have 

n„  b 

II 

i=U=  1 


Wtr 


V  V  (  tr\  P<K^)  ^  ■  (X  ;  Ira  V  Pr(^)  -V  ,  trx 


IIPWII 


i=  1 


“  HP W ||, 


i—l 


where  the  last  equality  follows  from  the  non-negativity  of  [3 /  ( A ) .  The  reciprocal  of  1 1 /?  1 1 2 / 1 1  [3  ( A)  1 1 1  is 
lower  bounded  as  follows: 


IIPMIb 


max  ■ 


m 


> 


max  ■ 


m 


=  max 


m) 


where  the  last  equality  follows  from  the  fact  that  there  is  an  f  such  that  (3r  (A)  >  0;  otherwise,  we 
have  Y!jl]  "’(a)')  (3)  =  0  which  contradicts  to  the  assumption  of  the  theorem.  Let  us  put 


Ke  = 


PM 


where  K  >  0  and  e  G  1ft  '  such  that  |  c  1 1 2  =  1.  Then  we  have 

(Kmax  +  Ap1  <  K  and  eTh  >  0. 
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Note  that  there  exists  an  £  such  that  ee  >  0.  Then,  we  have 


(W  ^  1 

max  — ^ —  =  max  Ke?  =  K  max  <?/  >  - -  max  ee 

i  ||/,||2  i  i  Kmax  +  A,  i 

>  - - — -  minjmax  et,  \  eJ e  =  1,  eT/j/||/?||i  >  0}. 

Kmax  T  A-  e  £ 

Now  we  prove  the  following  lemma. 

Lemma  8  Let  p\ ,p2,  ■  ■ . ,  Pb  (b  >  2)  be  positive  numbers  such  that 

b 

IPf  =  1> 

1=  1 


and  let 


1  .  pe 

E  =  — —  min  - 


Vb  t  I  ~  Pi 


Then,  there  exists  no  e  =  (e\,e2,  ■  ■  .  ,e&)  £  R h  such  that  the  three  conditions, 
b  b 

Ye^  =  I ,  ^  P(e(  >  0)  and  e£  <  zfor  £  =  1,2, ...  ,b 

e=i  i=  i 


are  satisfied  at  the  same  time. 

Proof  We  suppose  that  e  £  M1’  satisfies  the  three  conditions.  If  min/  pe/ ( I  —  pe)  >  1,  we  have 
pe  >1/2  for  all  £.  However,  this  is  contradictory  to  Yj>-\  Pt  =  1.  Therefore,  we  have 


minpe/(l-pe)  <  1, 


from  which  we  have 

e  <  1  /  Vb. 

The  equality  constraint  Y!l_\  ej  =  1  implies  the  condition  that  there  exists  an  e,  such  that  \ei\>l/Vb. 
Moreover,  we  have  e\ ,  e^,  ■  ■ ,  ,eb  <  £  <  1/ Vb,  and  thus  there  is  an  <?,  such  that  et  <  —  1/ Vb.  Hence, 
we  have 


Pi 

Vb 


< 


-pm  < 


Y,ptet 


Pk 


<  Y^Pi—rr  m,in  i 
^  Vb  k  1  -pk 


( 1  —  Pi)  —7=  min  — — — 

V  ’Vb  k  1  -pk 


< 


Pi 

Vb 


This  results  in  contradiction. 


Let  pe  =  he/ \\h\\i  and  we  use  Lemma  8.  Note  that  any  element  of  h  is  positive.  Then,  we  have 


\\m\u 


> 


nnn  - 


P£ 


:+X  Vb  t  Yli^iPi 


Moreover,  we  have 

Pe 


nnn - 

i  Li^ePi 


> 


min  /  he 

Y.e>_  1  h^ 


min/£"^i  (p/; ( xj )  min/  1  Vy  ) 

Ic=iI"=i9K^)  “  ^ 
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where  the  last  inequality  follows  from  the  assumption  0  <  cp^(x)  <  1.  Therefore,  we  have  the  in¬ 
equality 


1  \AMnax  1 1  h  1 1 2 

r/.,»’Wr;P(X))  x 

^  1  /  u ,  ( i  i  ^max  \  1  ^te 

-  v  KmaXV  +  "X" )  min,  q>f(jff)  '  min^,  (p^) ' 

An  upper  bound  of  Kmax  is  given  as  follows.  For  all  a  £  R6,  the  inequality 

b  b  b 

-  Y  m^o)  <  L  a^^(x)  ^  Y 

i= i  ^=1  ^=i 


(63) 


(64) 


holds  because  of  the  positivity  of  cp^(x).  Let  us  define  a  eRb  for  given  a  £  Rfo  as 

a  =  (|«i | ,  | «2 1 ,  -  •  •  j  Wb\)T ■ 

Note  that  ||d||2  =  ||a||2  holds.  Then,  using  Eq.  (64),  we  obtain  the  inequality 

1  "tr  (  b  \  2  1  »tr  (  b  \  2 

aJHa  =  —  Y  £  aeye(xf)  <  —  Y  £  M<p*(*/r)  =  a1  Hci, 

i-\  \(:=\  )  ntr  ,-=1  \(-X  J 

for  any  a  £R*.  Therefore,  we  obtain 

max  aT Ha  <  max  a1  Ha  =  max  a1  Ha, 

||a||2=l  ||a||2=l  ||a||2=l,a>0t 

where  the  last  equality  is  derived  from  the  relation, 

{a  |  ||a||2  =  1,  a  £  R6}  =  {a  \  \\a\\2  =  1,  a  >  Ob,  a  £  R6}. 

On  the  other  hand,  due  to  the  additional  constraint  a  >  0/,,  the  inequality 

max  a  Ha  <  max  a1  Ha 


|ja||2=l,  a>0b 

holds.  From  Eqs.  (65)  and  (66),  we  have 


IH|2=i 


^  ^  1  n«  (  b 
Kmax  =  max  a1  Ha  =  max  a1  Ha  =  max  —  V  V  aetyAxf 

||a||2=l  ||a||2=l,a>0(,  ||a||2=l,  a>0b  ntr  £l 


Using  the  assumption  0  <  cp^(x)  <  1,  we  have 

J  «tr  /  b 


«tr  (  b 


=  max  —  V  V  nytpAxf)  <  max  —  V  V  Of 
IM|2=1,  a>ob  ntr  )  ||a||2=l,  a> ob  ntr  f-[ 

/  b  \2  b 

=  max  ^  at\  <  max  b'Ya7> 

IM|2=1,  a>0b  w=i  J  ||a||2=l,  a>0b  (=l 


(65) 


(66) 


=  b,  (67) 

where  the  Schwarz  inequality  for  a  and  L,  is  used  in  the  last  inequality.  The  inequalities  (63)  and 
(67)  lead  to  the  inequality  (29). 

It  is  clear  that  the  upper  bound  (29)  is  a  decreasing  function  of  X  (>  0).  For  the  Gaussian  basis 
function,  tpr(x)  is  an  increasing  function  with  respect  to  the  Gaussian  width  a.  Thus,  Eq.  (29)  is  a 
decreasing  function  of  G. 
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Appendix  H.  ‘Bridge’  Upper  Bound  of  Approximation  Error  for  uLSIF 

Here  we  prove  Theorem  7. 

From  the  triangle  inequality,  we  obtain 

diff(X)  <  infv>o||«(V)-Y(X)||g  +  ||Y(X)^P(X)||g 

We  derive  an  upper  bound  of  the  first  term. 

First,  we  show  that  the  LSIF  optimization  problem  (6)  is  equivalently  expressed  as 

'1  ~T 

nun  -a  Ha  — h  a 

aeR4  |_2 

subject  to  a  >  0/,.  1;} a  <  c, 

which  we  refer  to  as  LSIF7.  The  KKT  conditions  of  LSIF  (6)  are  given  as 

J  Ha  —  h  +  Xl*  —  p  =  Ofo, 

\  a  >  0/,,  p  >  0 b,  aTp  =  0, 


(68) 


where  p  is  the  Lagrange  multiplier  vector.  Similarly,  the  KKT  conditions  of  LSIF7  are  given  as 

{Ha-h+nolb  -p  =  Ob, 

a>0 h,  p>0 h,  aT/j  =  0,  (69) 

lja-c  <  0,  p0  >  0,  (ljoc-c)po  =  0, 

where  p  and  po  arc  the  Lagrange  multipliers.  Let  (a(X),ju(X))  be  the  solution  of  the  KKT  conditions 
of  LSIF.  Then,  we  find  that  (a,p,po)  =  (a(X),ju(X),X)  is  the  solution  of  Eq.  (69)  with  c  =  1^6c(X). 
Note  that  LSIF7  is  a  strictly  convex  optimization  problem,  and  thus  a(A,)  is  the  unique  optimal 
solution.  Conversely,  when  the  solution  of  Eq.  (69)  is  provided  as  (a,p,po),  LSIF  with  X  =  po  has 
the  same  optimal  solution  a. 

When  the  optimal  solution  of  LSIFq  is  y(/L) ,  the  KKT  conditions  of  LSIFq  (30)  are  given  as 


Hy(X)  -h  +  Xy(X)  -r\  =  0b,  (70) 

y(X)>0b,  rj  >  0b,  y(?c)Tfj  =  0,  (71) 

where  rj  is  the  Lagrange  multiplier  vector. 

Let  a(A,i)  be  the  optimal  solution  of  LSIF7  with  c  =  lj  y(A,),  and  suppose  that  the  solution  a(An) 
coincides  with  that  of  LSIF  with  X  =  A,i.  Then,  from  Eq.  (69),  we  have 

Ha(Xl)-h  +  Xllb-ju(Xl)  =  0b,  (72) 

a(^i)  >  0b,  p(X0>0 b,  a(Xl)T/u{X1)=0,  (73) 

1*  -  Ifo  Y(^)  <  0,  ?u>0,  (lfe  «(^i)  -  lfo  y(^))^i  =  0.  (74) 


From  Eqs.  (70)  and  (72),  we  obtain 

H(a(Xi) -y{X))  =  -Xi  lb  +  Xy(X)  +p(X i)  -  rj.  (75) 
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Applying  Eqs.  (71),  (73),  (74),  and  (75),  we  have 

mf  ||a(V)-7(X)|||  <  (a(^)-yi (l))TH(a(h) -y(l)) 

=  -  h  (a(Xi )  -■ y(X))T  h  +  )  -■ y(X))Ty(X) 

+  (a(x1)-y(X))T(^1)-fi) 

=  Ma(Xi)TY(?i)-||YWIli)-oc(^i)Tn-YWT^i) 

From  a(A,i)  >  ()/,,  y(X)  >  Ob,  and  l^a(^i)  <  I lyCk),  we  have 

||a(X1)||1  =  l2'a(X1)<lj7(X)<||Y(X)||i. 

Then  we  have  the  following  inequality: 

a(^)TYW<«(^i)T(llYWI|ooU) 

=  ||a(X1)||1-||YWI|oo<||YWI|f||YWIk-  (77) 

For  p  and  q  such  that  \/p+l/q  =  1  and  I  <  p.q  <  °o.  Holder’s  inequality  states  that 

|la*PI|i  <  llallp'  IIPII?) 

where  a*  P  denotes  the  element-wise  product  of  a  and  p.  Setting  p  =  1,  q  =  °°,  and  a  =  P  =  yCk) 
in  Holder’s  inequality,  we  have 

llY(^)llt-||Y(^)IU-||Y(X)|||>0.  (78) 

Combining  Eqs.  (68),  (76),  (77),  and  (78),  we  obtain 

, ,  yMiiYwiifiiYWiu-iiYWii?) + ii" m  -  toiig 

diff(X)  <  V - ^ 

Appendix  I.  Closed  Form  of  LOOCV  Score  for  uLSIF 

Here  we  derive  a  closed  form  expression  of  the  FOOCV  score  for  uFSIF  (see  Figure  2  for  the 
pseudo  code). 

Fet 

q>M  = 

Then  the  matrix  H  and  the  vector  h  are  expressed  as 

1  ntr 

H=-^xrM4)\ 

”tr  I=  1 
^  1  "te 

h=—£  ^ 

te  7=1 
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and  the  coefficients  (3  (A.)  can  be  computed  by 


m=si  h. 


?(0 


Let  [3  be  the  estimator  obtained  without  the  ;-th  training  sample  xf  and  the  i-th  test  sample  xf. 
Then  the  estimator  has  the  following  closed  form: 

p(V)=max(OiJ('V)), 


p(V)  = 


^—KH-(p(.rf)(p(xf)T)  +  A/^  — r 

«tr  -  1  J  «te  -  1 


(nteh  -  cp(xj)). 


Let  B  =  H  + 


A(>;tr—  1) 


»tr 


1 1,  and  P  =  B  h  in  the  following  calculation.  Using  the  Sherman- Woodbury- 

r(0 , 


Morrison  formula  (33),  we  can  simplify  the  expression  of  p  (A)  as  follows: 

-l 


«tr 


1 


s--cp(xf)cp(xf) 

«tr 


^te  'T'  1  /  tes 

h - tV(x?s 


"■-‘'rh 

«tr 


-1 


«tr-(p(xf)T5  cp(xf) 


«te  -  1  «te  -  1 

2"W)<p(*f)T2 


— 1-.cp(xf: 

«te  -  1  «te  -  1 


_(«tr-  l)»te 
«tr(«te  -  1) 

\ 

_  («tr~  1) 


P+- 


cp(xf )  1  P 


«tr-(p(xf)Tfi  cp(xf) 


B  cp(xf 


~-l  /  te\  tp(xf)  B  tp(xf)  ---1  , 

B  cp(x*e)  H  _i_w_y - /j  cp(xf) 

V  «tr  -  ( ?(xf)TB  cp(xf ) 


r(0 , 


Thus  the  matrix  inversion  required  for  computing  P  (A)  for  all  i  =  1,2, _ ,  ntv  is  only  B.  Applying 

this  to  Eq.  (32)  and  rearrange  the  formula,  we  can  compute  the  LOOCV  score  analytically. 
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Abstract 

Methods  for  directly  estimating  the  ratio  of  two  probability  density  functions  have 
been  actively  explored  recently  since  they  can  be  used  for  various  data  processing 
tasks  such  as  non-stationarity  adaptation ,  outlier  detection ,  and  feature  selection.  In 
this  paper,  we  develop  a  new  method  which  incorporates  dimensionality  reduction 
into  a  direct  density-ratio  estimation  procedure.  Our  key  idea  is  to  find  a  low¬ 
dimensional  subspace  in  which  densities  are  significantly  different  and  perform  den¬ 
sity  ratio  estimation  only  in  this  subspace.  The  proposed  method,  D3-LHSS  (Direct 
Density-ratio  estimation  with  Dimensionality  reduction  via  Least-squares  Hetero- 
distributional  Subspace  Search),  is  shown  to  overcome  the  limitation  of  baseline 
methods. 


Keywords 

density  ratio  estimation,  dimensionality  reduction,  unconstrained  least-squares  im¬ 
portance  fitting 


1  Introduction 

Recently,  it  has  been  demonstrated  that  various  machine  learning  and  data  mining  tasks 
can  be  formulated  in  terms  of  the  ratio  of  two  probability  density  functions  (Sugiyama 
et  al.,  2009;  Sugiyama  et  ah,  2011).  Examples  of  such  tasks  include  covariate  shift  adap¬ 
tation  (Shimodaira,  2000;  Zadrozny,  2004;  Sugiyama  et  al.,  2007;  Sugiyama  &  Kawanabe, 
2010),  transfer  learning  (Storkey  &  Sugiyama,  2007),  multi-task  learning  (Bickel  et  al., 
2008),  outlier  detection  (Hido  et  al.,  2008;  Smola  et  al.,  2009;  Hido  et  al.,  2010),  condi¬ 
tional  density  estimation  (Sugiyama  et  al.,  2010c),  probabilistic  classification  (Sugiyama, 
2010),  variable  selection  (Suzuki  et  al.,  2009a),  independent  component  analysis  (Suzuki 
&  Sugiyama,  2009),  supervised  dimensionality  reduction  (Suzuki  &  Sugiyama,  2010),  and 
causal  inference  (Yamada  &  Sugiyama,  2010),  For  this  reason,  estimating  the  density 
ratio  has  been  attracting  a  great  deal  of  attention,  and  various  approaches  have  been 
explored  (Silverman,  1978;  Cwik  &  Mielniczuk,  1989;  Gijbels  &  Mielniczuk,  1995;  Sun  & 
Woodroofe,  1997;  Jacob  &  Oliveira,  1997;  Qin,  1998;  Cheng  &  Chu,  2004;  Huang  et  al., 
2007;  Bensaid  &  Fabre,  2007;  Bickel  et  al.,  2007;  Sugiyama  et  al.,  2008;  Kanamori  et  al., 
2009a;  Chen  et  al.,  2009;  Sugiyama  et  al.,  2010b;  Nguyen  et  al.,  2010). 

A  naive  approach  to  density  ratio  estimation  is  to  approximate  the  two  densities  in 
the  ratio  (i.e.,  the  numerator  and  the  denominator)  separately  using  a  flexible  technique 
such  as  non-parametric  kernel  density  estimation  (Silverman,  1986;  Hardle  et  al.,  2004), 
and  then  take  the  ratio  of  the  estimated  densities.  However,  this  naive  two-step  approach 
is  not  reliable  in  practical  situations  since  kernel  density  estimation  performs  poorly  in 
high-dimensional  cases;  furthermore,  division  by  an  estimated  density  tends  to  magnify 
the  estimation  error.  To  improve  the  estimation  accuracy,  various  methods  have  been 
developed  for  directly  estimating  the  density  ratio  without  going  through  density  esti¬ 
mation,  e.g.,  the  moment  matching  method  using  reproducing  kernels  (Aronszajn,  1950; 
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Knowing  two  densities 

Knowing  ratio 

,  s  Pnn{x) 

Pnu{X),Pde{X) 

r{x)  - 

Pde \X) 

Figure  1:  Density  ratio  estimation  is  substantially  easier  than  density  estimation.  The 
density  ratio  r(x)  can  be  computed  if  two  densities  pnu(x )  and  p(\e{x)  are  known.  However, 
even  if  the  density  ratio  is  known,  the  two  densities  cannot  be  computed  in  general. 


Steinwart,  2001)  called  kernel  mean  matching  (KMM)  (Huang  et  ah,  2007;  Quinonero- 
Candela  et  ah,  2009),  the  method  based  on  logistic  regression  (LR)  (Qin,  1998;  Cheng 
&  Chu,  2004;  Bickel  et  ah,  2007),  the  distribution  matching  method  under  the  Kullback- 
Leibler  (KL)  divergence  (Kullback  &  Leibler,  1951)  called  the  KL  importance  estimation 
procedure  (KhlEP)  (Sugiyama  et  ah,  2008;  Nguyen  et  ah,  2010),  and  the  density-ratio 
matching  methods  under  the  squared-loss  called  least-squares  importance  fitting  (LSIF) 
and  unconstrained  LSIF  (uhSIF)  (Kanamori  et  ah,  2009a).  These  methods  have  been 
shown  to  compare  favorably  with  naive  kernel  density  estimation  through  extensive  ex¬ 
periments. 

The  success  of  these  direct  density-ratio  estimation  methods  could  be  intuitively  un¬ 
derstood  through  Vapnik’s  principle  (Vapnik,  1998):  “When  solving  a  problem  of  interest, 
one  should  not  solve  a  more  general  problem  as  an  intermediate  step” .  The  support  vec¬ 
tor  machine  would  be  a  successful  example  following  this  principle — instead  of  estimating 
the  data  generation  model,  it  directly  models  the  decision  boundary  which  is  simpler  and 
sufficient  for  pattern  recognition.  In  the  current  context,  estimating  the  densities  is  more 
general  than  estimating  the  density  ratio  since  knowing  the  two  densities  implies  knowing 
the  ratio,  but  not  vice  versa  (Figure  1).  Thus  directly  estimating  the  density  ratio  would 
be  more  promising  than  density  ratio  estimation  via  density  estimation. 

However,  density  ratio  estimation  in  high-dimensional  cases  is  still  challenging  even 
when  the  ratio  is  estimated  directly  without  going  through  density  estimation.  Recently, 
an  approach  called  Direct  Density-ratio  estimation  with  Dimensionality  reduction  (D3) 
has  been  proposed  (Sugiyama  et  ah,  2010a).  The  basic  idea  of  D3  is  the  following  two- 
step  procedure:  First  a  subspace  in  which  the  numerator  and  denominator  densities  are 
significantly  different  (called  the  hetero-distributional  subspace )  are  identified,  and  then 
density  ratio  estimation  is  performed  in  this  subspace.  The  rationale  behind  this  approach 
is  that,  in  practice,  the  distribution  change  does  not  occur  in  the  entire  space,  but  is 
often  confined  in  a  subspace.  For  example,  in  non-stationarity  adaptation  scenarios,  the 
distribution  change  often  occurs  only  for  some  attributes  and  other  variables  are  stable;  in 
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outlier  detection  scenarios,  only  a  small  number  of  attributes  would  cause  a  data  sample 
to  be  an  outlier. 

In  the  D3  algorithm,  the  hetero-distribntional  snbspace  is  identified  by  searching  a 
subspace  in  which  samples  drawn  from  the  two  distributions  (i.e.,  the  numerator  and  the 
denominator  of  the  ratio)  are  separated  from  each  other — this  search  is  carried  out  in 
a  computationally  efficient  manner  using  a  supervised  dimensionality  reduction  method 
called  local  Fisher  discriminant  analysis  (LFDA)  (Sugiyama,  2007).  Then,  within  the 
identified  hetero-distributional  subspace,  a  direct  density-ratio  estimation  method  called 
unconstrained  least-squares  importance  Fitting  (uLSIF) — which  was  shown  to  be  com¬ 
putationally  efficient  (Kanamori  et  al.,  2009a)  and  numerically  stable  (Kanamori  et  al., 
2009b) — is  employed  for  obtaining  the  final  density-ratio  estimator.  Through  experi¬ 
ments,  this  D3  procedure  (which  we  refer  to  as  D3-LFDA/uLSIF)  was  shown  to  improve 
the  performance  in  high-dimensional  cases. 

Although  the  framework  of  D3  is  promising,  the  above  D3-LFDA/uLSIF  method  pos¬ 
sesses  two  fundamental  weaknesses:  the  restrictive  definition  of  the  hetero-distributional 
subspace  and  the  limiting  ability  of  its  search  method.  More  specifically,  the  component 
inside  the  hetero-distributional  subspace  and  its  complementary  component  are  assumed 
to  be  statistically  independent  in  the  original  formulation  (Sugiyama  et  ah,  2010a).  How¬ 
ever,  this  assumption  is  rather  restrictive  and  may  not  be  fulfilled  in  practice.  Also,  in 
the  above  D3  procedure,  the  hetero-distributional  subspace  is  identified  by  searching  a 
subspace  in  which  samples  drawn  from  the  numerator  and  denominator  distributions  are 
separated  from  each  other.  If  samples  from  the  two  distributions  are  separable,  the  two 
distributions  would  be  significantly  different.  However,  the  opposite  may  not  be  always 
true,  i.e.,  non-separability  does  not  necessarily  imply  that  the  two  distributions  are  dif¬ 
ferent  (consider  two  similar  distributions  with  the  common  support).  Thus  LFDA  (and 
any  other  supervised  dimensionality  reduction  methods)  does  not  necessarily  identify  the 
correct  hetero-distributional  subspace. 

The  goal  of  this  paper  is  to  give  a  new  procedure  of  D3  that  can  overcome  the  above 
weaknesses.  First,  we  adopt  a  more  general  definition  of  the  hetero-distributional  sub¬ 
space.  More  precisely,  we  remove  the  independence  assumption  between  the  component 
inside  the  hetero-distributional  subspace  and  its  complementary  component.  This  allows 
us  to  apply  the  concept  of  D3  to  a  wider  class  of  problems.  However,  this  general  def¬ 
inition  in  turn  makes  the  problem  of  searching  the  hetero-distributional  subspace  more 
challenging — supervised  dimensionality  reduction  methods  for  separating  samples  drawn 
from  the  two  distributions  cannot  be  used  anymore,  but  we  need  an  alternative  method 
that  identifies  the  largest  subspace  such  that  the  two  conditional  distributions  are  equiv¬ 
alent  in  its  complementary  subspace. 

We  prove  that  the  hetero-distributional  subspace  can  be  identified  by  finding  a  sub¬ 
space  in  which  two  marginal  distributions  are  maximally  different  under  the  Pearson 
divergence,  which  is  a  squared-loss  variant  of  the  Kullback-Leibler  divergence  and  is  an 
instance  of  the  f- divergences  (Ali  &  Silvey,  1966;  Csiszar,  1967).  Then  we  propose  a 
new  method,  which  we  call  Least-squares  Hetero-distributional  Subspace  Search  (LHSS), 
for  searching  a  subspace  such  that  the  Pearson  divergence  between  two  marginal  distri- 
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Proposed  approach:  Direct  density-ratio  estimation  with  dimensionality  reduction 
( CP-LHSS ;  the  above  two  steps  are  merged  into  a  single  process) 


Simultaneously  identifying  hetero-distributional  subspace 

and  directly  estimating  the  ratio  in  the  subspace  by  uLSIF 

Figure  2:  Existing  and  proposed  density-ratio  estimation  approaches. 


butions  are  maximized.  An  advantage  of  the  LHSS  method  is  that  the  subspace  search 
(divergence  estimation  within  a  subspace)  is  carried  out  also  using  the  density-ratio  es¬ 
timation  method  uLSIF.  Thus  the  two  steps  in  the  D3  procedure  (first  identifying  the 
hetero-distributional  subspace  and  then  estimating  the  density  ratio  within  the  subspace) 
are  merged  into  a  single  step.  Thanks  to  this,  the  final  density-ratio  estimator  can  be  au¬ 
tomatically  obtained  without  additional  computation.  We  call  the  combined  single-shot 
density-ratio  estimation  procedure  Eft  via  LHSS  (D3-LHSS).  Through  experiments,  we 
show  that  the  weaknesses  of  the  existing  approach  can  be  successfully  overcome  by  the 
D3-LHSS  approach. 

Relation  among  the  existing  and  proposed  density-ratio  estimation  methods  is  sum¬ 
marized  in  Figure  2. 

2  Formulation  of  Density-ratio  Estimation  Problem 

In  this  section,  we  formulate  the  problem  of  density  ratio  estimation  and  review  a  relevant 
density-ratio  estimation  method.  We  briefly  summarize  possible  usage  of  density  ratios 
in  various  data  processing  tasks  in  Appendix  A. 
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2.1  Problem  Formulation 

Let  V  (c  Md)  be  the  data  domain  and  suppose  we  are  given  independent  and  identi¬ 
cally  distributed  (i.i.d.)  samples  {x"u}”='!  from  a  distribution  with  density  pn u(cc)  and 
i.i.d.  samples  from  another  distribution  with  density  pde(x).  We  assume  that 

the  latter  density  pde(x)  is  strictly  positive,  i.e., 

Pde(x)  >  0  for  all  x  G  V. 


The  problem  we  address  in  this  paper  is  to  estimate  the  density  ratio 


r(x ) 


Pnu{x'j 

Pde{x) 


from  samples  and  {x'-e}"de1 .  The  subscripts  ‘nu’  and  ‘de’  denote  ‘numerator’  and 

‘denominator’,  respectively. 


2.2  Directly  Estimating  Density  Ratios  by  Unconstrained 
Least-squares  Importance  Fitting  (uLSIF) 

As  described  in  Appendix  A,  density  ratios  are  useful  in  various  data  processing  tasks. 
Since  the  density  ratio  is  usually  unknown  and  needs  to  be  estimated  from  data,  methods 
of  estimating  the  density  ratio  have  been  actively  explored  recently  (Qin,  1998;  Cheng  & 
Chu,  2004;  Huang  et  al.,  2007;  Bickel  et  al.,  2007;  Sugiyama  et  ah,  2008;  Kanamori  et  ah, 
2009a).  Here,  we  briefly  review  a  direct  density-ratio  estimation  method  called  uncon¬ 
strained  least-squares  importance  fitting  (uLSIF)  proposed  by  Kanamori  et  al.  (2009a). 
For  convenience  in  later  sections,  we  replace  the  symbol  x  with  u ,  i.e.,  let  us  consider  the 
problem  of  estimating  the  density  ratio 


r(u ) 


Pmi(u) 

Pde(u ) 


from  the  i.i.d.  samples  {it™}”™}  and  {n)le}j=ei- 

2.2.1  Linear  Least-squares  Estimation  of  Density  Ratios 

Let  us  model  the  density  ratio  r{u )  by  the  following  linear  model: 


b 

r(u)  :=  ^2®eMu), 
i=  i 


OL  («!,  02,  •  •  •  ,  Ctfc)T 


where 
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are  parameters  to  be  learned  from  data  samples,  6  denotes  the  number  of  parameters,  T 
denotes  the  transpose  of  a  matrix  or  a  vector,  and  {^e(u)}b=1  are  basis  functions  such 
that 

i/j t(u )  >  0  for  all  u  and  for  £  —  1,  2, . . . ,  b. 

Note  that  b  and  {i^e(u)}g=1  could  be  dependent  on  the  samples  and  {u‘-e}”le1, 

meaning  that  kernel  models  are  also  allowed.  We  explain  how  the  basis  functions 
are  designed  in  Section  2.2.2. 

The  parameters  {ae}be=1  in  the  model  r(u)  are  determined  so  that  the  following  squared 
error  Jq  is  minimized: 


Jn(ot)  :  = 


(: r(u )  —  r(u))2  pde(u)du 


=  2  f  r(u)2pde(u)du  -  J  r{u)pmi{u)du  +  -  J  r(u)pnn(u)du, 

where  the  last  term  is  a  constant  and  therefore  can  be  safely  ignored.  Let  us  denote  the 
first  two  terms  by  J : 


J{ol)  ■■=  -  /  r(u)2pde(u)du  -  /  r{u)pmi(u)du. 


(1) 


Note  that  the  same  objective  function  can  be  obtained  via  the  Legendre-Fenchel  duality 
of  a  divergence  (Nguyen  et  ah,  2010). 

Approximating  the  expectations  in  J  by  empirical  averages,  we  obtain 


J(«)  :  = 


-j  ^de  -j  ^nu 

2nde  y  J  nmi  y 

i=i  i=i 


r  U; 


1  -j -  ~T 

=  -a  Hex  —  h  ex. 

2 

where  H  is  the  6x6  matrix  with  the  (£,  £')- th  element 

-j  ^"de 

:=  — V] rl)i{uf)rj)e{uf  X 
nde  “ 

3= i 


and  h  is  the  6-dimensional  vector  with  the  h-tli  element 

-j  Unu 

h,  :=  —  T  MuT). 

nm  “ 

1=1 

Now  the  optimization  problem  is  formulated  as  follows. 

ex  :=  argmin  -ex  Hex  —  h  ex  H — ex  "  ex 
2  2 


(2) 


(3) 


(4) 
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where  a  penalty  term  \otTot/2  is  included  for  regularization  purposes,  and  A  (>  0)  is  a 
regularization  parameter  that  controls  the  strength  of  regularization.  It  is  easy  to  confirm 
that  the  solution  a.  can  be  analytically  computed  as 

a  =  (H  +  XIb)~%  (5) 

where  is  the  6-dimensional  identity  matrix.  Thanks  to  this  analytic-form  expression, 
uLSIF  is  computationally  efficient  compared  with  other  density-ratio  estimators  which 
involve  non-linear  optimization  (Qin,  1998;  Cheng  &  Chu,  2004;  Huang  et  al.,  2007; 
Bickel  et  ah,  2007;  Sugiyama  et  ah,  2008;  Nguyen  et  ah,  2010). 

In  the  original  uLSIF  paper  (Kanamori  et  ah,  2009a),  the  above  solution  is  further 
modified  as 


oti  { —  max(0,  ag). 

This  modification  may  improve  the  estimation  accuracy  in  finite  sample  cases  since  the 
true  density  ratio  is  non-negative.  Even  so,  we  still  use  Eq.(5)  as  it  is  since  it  is  differen¬ 
tiable  with  respect  to  U,  where  u  =  Ux.  This  differentiability  will  play  a  crucial  role  in 
the  next  section.  Note  that,  even  without  the  above  round-up  modification,  the  solution 
is  guaranteed  to  converge  to  the  optimal  vector  asymptotically  both  in  parametric  and 
non-parametric  cases  (Kanamori  et  ah,  2009a;  Kanamori  et  ah,  2009b).  Thus  omitting 
the  above  modification  step  may  not  have  a  strong  effect. 

It  was  theoretically  shown  that  uLSIF  possesses  superior  theoretical  properties  in 
statistical  convergence  and  numerical  stability  (Kanamori  et  ah,  2009a;  Kanamori  et  ah, 
2009b). 

2.2.2  Basis  Function  Design 

The  performance  of  uLSIF  depends  on  the  choice  of  the  basis  functions  {^/ h{u)}\=\ ■  As 
explained  below,  the  use  of  Gaussian  basis  functions  would  be  reasonable: 

Thnu 

r(u)  =  ^a£K(u,u£u), 

t=i 

where  K(u,u')  is  the  Gaussian  kernel  with  kernel  width  a  (>  0): 

K(u,  u')  =  exp 

By  definition,  the  density  ratio  r(u)  tends  to  take  large  values  if  pmi(u)  is  large  and 
Pde(u )  is  small;  conversely,  r(u)  tends  to  be  small  (i.e.,  close  to  zero)  if  pnu(u)  is  small 
and  Pde(u )  is  large.  When  a  non- negative  function  is  approximated  by  a  Gaussian  kernel 
model,  many  kernels  may  be  needed  in  the  region  where  the  output  of  the  target  function 
is  large;  on  the  other  hand,  only  a  small  number  of  kernels  would  be  enough  in  the  region 
where  the  output  of  the  target  function  is  close  to  zero  (see  Figure  3).  Following  this 
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Figure  3:  Heuristic  of  Gaussian  kernel  allocation. 

heuristic,  we  allocate  many  kernels  in  the  region  where  pn u(w)  takes  large  values,  which 
may  be  approximately  achieved  by  setting  the  Gaussian  centers  at 

Alternatively,  we  may  locate  (nnu  +  7ide)  Gaussian  kernels  at  both  and 

{tije}”le1.  However,  in  our  preliminary  experiments,  this  did  not  further  improve  the 
performance,  but  slightly  increased  the  computational  cost.  When  nnu  is  very  large,  just 
using  all  the  test  input  points  as  Gaussian  centers  is  already  computationally 

rather  demanding.  To  ease  this  problem,  a  subset  of  may  be  used  as  Gaussian 

centers  for  computational  efficiency,  i.e. ,  for  a  prefixed  b  (g  {1,  2, . . . ,  nmi } ) ,  we  use 

b 

r(u)  =  YatK(u,ct), 

t=  i 

where  are  template  points  randomly  chosen  from  {it™1}”™!  without  replacement. 

The  performance  of  uLSIF  depends  on  the  kernel  width  a  and  the  regularization 
parameter  A.  Model  selection  of  uLSIF  is  possible  based  on  cross-validation  (CV)  with 
respect  to  the  error  criterion  (1)  (Kanamori  et  al.,  2009a). 

3  Direct  Density-ratio  Estimation  with  Dimension¬ 
ality  Reduction 

Although  uLSIF  was  shown  to  be  a  useful  density  ratio  estimation  method  (Kanamori 
et  ah,  2009a),  estimating  the  density  ratio  in  high-dimensional  spaces  is  still  challenging. 
In  this  section,  we  propose  a  new  method  of  direct  density-ratio  estimation  that  involves 
dimensionality  reduction. 

3.1  Hetero-distributional  Subspace 

Our  basic  idea  is  to  first  find  a  low-dimensional  subspace  in  which  the  two  densities  are 
significantly  different  from  each  other,  and  then  perform  density  ratio  estimation  only 
in  this  subspace.  Although  a  similar  framework  has  been  explored  in  Sugiyama  et  ah 
(2010a),  the  current  formulation  is  substantially  more  general  than  the  previous  approach, 
as  explained  below. 
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Let  u  be  an  m-dimensional  vector  (1  <  m  <  d )  and  v  be  a  (d  —  m)-dimensional  vector 
defined  as 

u 
v 

where  U  is  an  m  x  d  matrix  and  V  is  a  (d  —  m)  x  d  matrix.  In  order  to  ensure  the 
uniqueness  of  the  decomposition,  we  assume  (without  loss  of  generality)  that  the  row 
vectors  of  U  and  V  form  an  orthonormal  basis,  i.e. ,  U  and  V  correspond  to  “projection” 
matrices  that  are  orthogonally  complementary  to  each  other  (see  Figure  4).  Then  the  two 
densities  pnu(cc)  and  pde(x)  can  be  decomposed  as 


Pnn(x)  =  Pnn{v\u)pnu(u), 

Pde(x )  =  pde(v\u)pde(u). 

The  key  theoretical  assumption  which  forms  the  basis  of  our  proposed  algorithm  is 
that  the  conditional  densities  pnu(v\u)  and  pde(v\u)  agree  with  each  other,  i.e.,  the  two 
densities  pm(x)  and  pde(x)  are  decomposed  as 

Pnu(x)  =  p(v\u)pnu(u), 
pde(x)  =  p(v\u)pde(u), 

where  p(v\u)  is  the  common  conditional  density.  This  assumption  implies  that  the 
marginal  densities  of  u  are  different,  but  the  conditional  density  of  v  given  u  is  com¬ 
mon  to  pnu(a3)  and  pde(x).  Then  the  density  ratio  is  simplified  as 


r(x) 


Pnu(u) 
P 'de(tt) 


r(u). 


Thus,  the  density  ratio  does  not  have  to  be  estimated  in  the  entire  d-dimensional  space, 
but  it  is  sufficient  to  estimate  the  ratio  only  in  the  m-dimensional  subspace  specified  by 

U. 

Below,  we  will  use  the  term,  the  hetero- distributional  subspace,  for  indicating  the 
subspace  specified  by  U  in  which  pmi(u)  and  Pde(^)  are  different.  More  precisely,  let  S 
be  a  subspace  specified  by  U  and  V  such  that 


S  =  {UTUx  \  pim(v\u)  =  pde(v\u),  u  =  Ux,  v  —  Vxj. 


Then  the  hetero-distributional  subspace  is  defined  as  the  intersection  of  all  subspaces 
S.  Intuitively,  the  hetero-distributional  subspace  is  the  ‘smallest’  subspace  specified  by 
U  such  that  pnu(v|tt)  and  pde(v\u)  agree  with  each  other.  We  refer  to  the  orthogonal 
complement  of  the  hetero-distributional  subspace  as  the  homo- distributional  subspace  (see 
Figure  4). 

This  formulation  is  a  generalization  of  the  one  proposed  in  Sugiyama  et  al.  (2010a)  in 
which  the  components  in  the  hetero-distributional  subspace  and  its  complimentary  sub¬ 
space  are  assumed  to  be  independent  of  each  other.  On  the  other  hand,  we  do  not  impose 
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Homo-distributional  subspace 

Pnu(v\u)  =pde(v\u )  =p(v\u) 


Hetero-distributional  subspace 

Pnu(u)  ^pde(tl) 


Figure  4:  Hetero-distributional  subspace. 


such  an  independence  assumption  in  the  current  paper.  As  will  be  demonstrated  in  Sec¬ 
tion  4.1,  this  generalization  has  a  remarkable  effect  in  extending  the  range  of  applications 
of  direct  density-ratio  estimation  with  dimensionality  reduction. 

For  the  moment,  we  assume  that  the  true  dimensionality  m  of  the  hetero-distributional 
subspace  is  known.  Later,  we  explain  how  m  is  estimated  from  data. 

3.2  Estimating  Pearson  Divergence  Using  uLSIF 

Here,  we  introduce  a  criterion  for  hetero-distributional  subspace  search  and  how  it  is 
estimated  from  data. 

We  use  the  Pearson  divergence  (PD)  as  our  criterion  for  evaluating  the  discrepancy 
between  two  distributions.  PD  is  a  squared-loss  variant  of  the  Kullback-Leibler  divergence 
(Kullback  &  Leibler,  1951),  and  is  an  instance  of  the  /- divergences ,  which  are  also  known 
as  the  Csiszar  /-divergences  (Csiszar,  1967)  or  the  Ali-Silvey  distances  (Ah  &  Silvey, 
1966).  PD  from  pnu(x )  to  Pde(x)  is  defined  and  expressed  as 


PD[pnu(a?),pde(*)]  vanishes  if  and  only  if  pnu(x )  =  pde(x). 


The  following  lemma  (called  the  “ data  processing ”  inequality)  characterizes  the  hetero- 
distributional  subspace  in  terms  of  PD. 


Lemma  1  Let 


(6) 


Direct  Density-ratio  Estimation  with  Dimensionality  Reduction 


12 


1 
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r 


{  Pnu(x) 
\Pde(x) 


_  Pnu(u)\ 

Pde(u)J 

.A _ 


2 

pde(a;)da; 


PD[p 

nu  (■u),pde(u)] 

A 


'N 


PD[Pnu(*),Pde(*)] 

(constant) 


Figure  5:  Since  PD[pnu(cc),pde(a;)]  is  constant,  minimizing  \  f  Pde(x)dx 

is  equivalent  to  maximizing  PD[pnu(w),pde(u)]. 


Then  we  have 


PD[pnu(*),pde(*)]  -  PD[pnu(w),pde(w)]  =  \  f  \  pde(x) dx  (7) 

Z  J  \Pde[X)  Pde[U)  J 

>  0. 

A  proof  of  the  above  lemma  (for  a  class  of  /-divergences)  is  provided  in  Ap¬ 
pendix  B.  The  right-hand  side  of  Eq.(7)  is  non-negative,  and  it  vanishes  if  and  only 
if  pnu(v\u)  =  pde(v\u).  Since  PD[pnu(cc),pde(cc)]  is  a  constant  with  respect  to  U,  max¬ 
imizing  PD[pnu(w),pde(w)]  with  respect  to  U  leads  to  pnvL(v\u)  =  pde(v\u)  (Figure  5). 
That  is,  the  hetero-distributional  subspace  can  be  characterized  as  the  maximizer1  of 
PD[pnu(u),pde(u)]. 

Although  the  hetero-distributional  subspace  can  be  characterized  as  the  maximizer  of 
PD[pnu(u),pde(w)],  we  cannot  directly  find  the  maximizer  since  pnu(u )  and  pde(w)  are  un¬ 
known.  Here,  we  utilize  a  direct  density-ratio  estimator  uLSIF  (see  Section  2.2)  for  approx¬ 
imating  PD[pnu(w),pde(,u)]  from  samples.  Let  us  replace  the  density  ratio  pnu(w) / pde(u) 
in  Eq.(6)  by  a  density  ratio  estimator  r(u).  Approximating  the  expectation  over  pnu(u) 
by  an  empirical  average  over  {u"”}”l\l,  we  have  the  following  PD  estimator. 

1  Tlnu  -t 

PD[pnu(u),pde(«)]  :=  - - ^f«u)  -  -. 

i=l 

Since  uLSIF  was  shown  to  be  consistent  (i.e. ,  the  solution  converges  to  the  optimal 
value)  both  in  parametric  and  non-parametric  cases  (Kanamori  et  al.,  2009a;  Kanamori 
et  ah,  2009b),  PD  would  be  a  consistent  estimator  of  the  true  PD. 

1As  shown  in  Appendix  B,  the  data  processing  inequality  holds  not  only  for  PD,  but  also  for  any 
/-divergences.  Thus  the  characterization  of  the  hetero-distributional  subspace  is  not  limited  to  PD,  but 
is  applicable  to  all  /-divergences. 
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3.3  Least-squares  Hetero-distributional  Subspace  Search 
(LHSS) 

Given  the  uLSIF-based  PD  estimator  PD[pnu(w),pde(w)],  our  next  task  is  to  find  a  max¬ 
imizer  of  PD[pnu(rt), j9de(^)]  with  respect  to  U,  and  identify  the  hetero-distributional 
subspace  (cf.  the  data  processing  inequality  given  in  Lemma  1).  We  call  this  procedure 
Least-squares  Hetero-distributional  Subspace  Search  (LHSS). 

We  may  employ  various  optimization  techniques  to  find  a  maximizer  of 
P D  [pnu(ii) ,  Pde  (w)] .  Here  we  describe  several  possibilities. 


3.3.1  Plain  Gradient  Algorithm 


A  gradient  ascent  algorithm  would  be  a  fundamental  approach  to  non-linear  smooth 
optimization.  We  utilize  the  following  lemma. 

Lemma  2  The  gradient  of  PD[pnu(u) ,  pde(u)]  with  respect  to  U  is  expressed  as 

b 


<9PD  GA  ^  dh( 
~dU  ~  2s  aidU 

i=  i 


1  \  ^  ^  d  Hp  v 

2  A  a,CH'~gu~' 

ty=\ 


where  a  is  given  by  Eq.{ 5)  and 
dhp 


dU 

dHpy 

dU 

dMu) 


1  df)p(u 


n 


'^de 


i= 1 

^de 

E 

3= 1 


dU  ’ 

9^(ude 


dU 


.  ,  ,  (9D^/(nfe) 

J  'MO  +  MO-  3 


dU 


dU  =  —^(u  ~  ce)(x  ~  ce)  Mu)- 


do  (e 


is  a  pre-image  of  q  (e 


ce  =  U  do. 


(8) 

(9) 

(10) 

(11) 


A  proof  of  the  above  lemma  is  provided  in  Appendix  C.  Note  that  {cq}k=1  in  Eq.(8) 
depend  on  U  through  H  and  h  in  Eq.(5),  which  was  taken  into  account  when  deriving 
the  gradient  (see  Appendix  C).  A  plain  gradient  update  rule  is  then  given  as 


U 


u  +  t 


<9PD 

~dU' 


where  t  (>  0)  is  a  learning  rate,  t  may  be  chosen  in  practice  by  some  approximate  line 
search  method  such  as  Armijo ’s  rule  (Patriksson,  1999)  or  backtracking  line  search  (Boyd 
&  Vandenberghe,  2004). 

A  naive  gradient  update  does  not  necessarily  fulfill  the  orthonormality  UUT  =  Jm, 
where  Im  is  the  m-dimensional  identity  matrix.  Thus,  after  every  gradient  step,  we 
need  to  orthonormalize  U  by,  e.g.,  the  Gram- Schmidt  process  (Golub  &  Loan,  1996)  to 
guarantee  its  orthonormality.  However,  this  may  be  rather  time-consuming. 


Direct  Density-ratio  Estimation  with  Dimensionality  Reduction 


14 


3.3.2  Natural  Gradient  Algorithm 

In  the  Euclidean  space,  the  ordinary  gradient  gives  the  steepest  direction.  On  the 
other  hand,  in  the  current  setup,  the  matrix  U  is  restricted  to  be  a  member  of  the  Stiefel 
manifold  S^(R): 

§?n(M)  :=  {U  E  Mmxd  |  UUT  =  Im}. 


On  a  manifold,  it  is  known  that,  not  the  ordinary  gradient,  but  the  natural  gradient 
(Amari,  1998)  gives  the  steepest  direction.  The  natural  gradient  VPD(C7)  at  U  is  the 
projection  of  the  ordinary  gradient  onto  the  tangent  space  of  S^(M)  at  U. 

If  the  tangent  space  is  equipped  with  the  canonical  metric,  i.e.,  for  any  G  and  G'  in 
the  tangent  space, 

<G,G')  =  ltr(GTG'),  (12) 

the  natural  gradient  is  given  by 


VPD(E7) 


1  /  <9PD 

2  1  ~dU 


-U 


<9PD 

~dU 


Then  the  geodesic  from  U  to  the  direction  of  the  natural  gradient  VPD(£/)  over  §^(M) 
can  be  expressed  using  i  el  as 


Ut 


U  exp 


<9PD 

~dU 


T 


u 


where  ‘exp’  for  a  matrix  denotes  the  matrix  exponential,  i.e.,  for  a  square  matrix  T, 

OO 

exp(T)  :=  Y,  T\Tt-  (13) 

fc=0 

Thus,  line  search  along  the  geodesic  in  the  natural  gradient  direction  is  equivalent  to 
finding  a  maximizer  from 


{Ut\t>  0}. 

More  details  of  geometric  structure  of  the  Stiefel  manifold  can  be  found  in  Nishimori  and 
Akaho  (2005). 

A  natural  gradient  update  rule  is  then  given  as 

U^-Ut, 

where  t  (>  0)  is  the  learning  rate.  Since  the  orthonormality  of  U  is  automatically  satisfied 
in  the  natural  gradient  method,  it  would  be  computationally  more  efficient  than  the 
plain  gradient  method.  However,  optimizing  the  m  x  d  matrix  U  is  still  computationally 
expensive. 


Direct  Density-ratio  Estimation  with  Dimensionality  Reduction 


15 


Figure  6:  In  the  hetero-distributional  subspace  search,  rotation  which  changes  the  sub¬ 
space  only  matters  (the  solid  arrow);  rotation  within  the  subspace  (dotted  arrow)  can  be 
ignored  since  this  does  not  change  the  subspace.  Similarly,  rotation  within  the  orthogonal 
complement  of  the  hetero-distributional  subspace  can  also  be  ignored  (not  depicted  in  the 
figure). 


3.3.3  Givens  Rotation 

Another  simple  strategy  for  optimizing  U  is  to  rotate  the  matrix  in  the  plane  spanned 
by  two  coordinate  axes  (which  is  called  the  Givens  rotations ;  see  Golub  &  Loan,  1996). 
That  is,  we  randomly  choose  a  two-dimensional  subspace  spanned  by  the  i-th  and  j-th 
variables,  and  rotate  the  matrix  U  within  this  subspace: 

U  < -  R(g’j)U, 


where  is  the  rotation  matrix  by  angle  9  within  the  subspace  spanned  by  the  i-th 

and  j-tli  variables.  R/g  is  equal  to  the  identity  matrix  except  that  its  elements  (i,i), 
(i,j),  and  (j,j)  form  a  two-dimensional  rotation  matrix: 


r  p(® d)  1  f  Tfihj) ] 

l  -"-0  J  M  [■n'9  Lj 

F  phi)l  r  o(*J)  i 

b>  [■**'6  \hi. 


cos  9  sin  6 
—  sin  6  cos  9 


The  rotation  angle  9  (0  <  9  <  tt)  may  be  optimized  by  some  secant  method  (Press  et  al., 
1992). 

As  shown  above,  the  update  rule  of  the  Givens  rotations  is  computationally  very 
efficient.  However,  since  the  update  direction  is  not  optimized  as  in  the  plain/natural 
gradient  methods,  the  Givens-rotation  method  could  be  potentially  less  efficient  as  an 
optimization  strategy. 


3.3.4  Subspace  Rotation 

Since  we  are  searching  for  a  subspace,  rotation  within  the  subspace  does  not  have  any 
influence  on  the  objective  value  PD  (see  Figure  6).  This  implies  that  the  number  of 
parameters  to  be  optimized  in  the  gradient  algorithm  can  be  reduced. 
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For  a  skew- symmetric  matrix  M  (e  Mdxd),  i.e.,  MT  =  —M,  rotation  of  U  can  be 
expressed  as  follows  (Plumbley,  2005): 


I m  Om,(d—m)  exp(lVL) 


U 

V 


where  Odld'  is  the  d  x  d'  matrix  with  all  zeros,  and  exp {M)  is  the  matrix  exponential  of 
M  (see  Eq.(13)).  M  =  Od,d  (i.e.,  exp (Od,d)  =  Id)  corresponds  to  no  rotation.  Here  we 
update  U  through  the  matrix  M . 

Let  us  adopt  Eq.(12)  as  the  inner  product  in  the  space  of  skew-symmetric  matrices. 
Then  we  have  the  following  lemma. 

Lemma  3  The  derivative  of  PD  with  respect  to  M  at  M  =  Od,d  is  given  by 


<9PD 

Om,m 

9PD  t/T  ' 
dU  V 

dM 

1 

"C? 

0 

II 

§ 

t  c9PD  yT  \T 

L  \  dU  V  > 

O 

1 

_ 

(14) 


A  proof  of  the  above  lemma  is  provided  in  Appendix  D.  The  block  structure  of  Eq.(14) 
has  an  intuitive  explanation:  the  non-zero  off-diagonal  blocks  correspond  to  the  rotation 
angles  between  the  hetero-distributional  subspace  and  its  orthogonal  complement  which 
do  affect  the  objective  function  PD.  On  the  other  hand,  the  derivative  of  rotation  within 
the  two  subspaces  vanishes  because  this  does  not  change  the  objective  value.  Thus  the 
variables  to  be  optimized  are  only  the  angles  corresponding  to  the  non-zero  off-diagonal 
blocks  which  includes  only  m(d  —  m)  variables.  In  contrast,  the  plain/natural 

gradient  algorithms  optimize  the  matrix  U,  which  contains  md  variables.  Thus,  when  m 
is  large,  the  subspace  rotation  approach  may  be  computationally  more  efficient  than  the 
plain/natural  gradient  algorithms. 

The  gradient  ascent  update  rule  of  M  is  given  by 


M=Odd 


M 


<9PD 


d  M 


where  t  is  a  step-size.  Then  U  is  updated  as 


U 


Im  O m,(d—m)  6Xp (AT) 


u 

V 


The  conjugate  gradient  method  (Golub  &  Loan,  1996)  may  be  used  for  the  update  of  M . 

Following  the  update  of  U,  its  counterpart  V  also  needs  to  be  updated  accordingly 
since  the  hetero-distributional  subspace  and  its  complement  specified  by  U  and  V  should 
be  orthogonal  to  each  other  (see  Figure  4).  This  can  be  achieved  by  setting 

Vi -  [<Pi  2  •  •  •  Vd-m\  T  , 

where  <p1,  Lp2i  ■  *  ‘Rd-m  are  orthonormal  basis  vectors  in  the  orthogonal  complement  of 
the  hetero-distributional  subspace. 
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3.4  Proposed  Algorithm:  D3-LHSS 

Finally,  we  estimate  the  density  ratio  in  the  hetero-distributional  subspace  detected  by 
the  above  LHSS  method. 

A  notable  fact  of  the  LHSS  algorithm  is  that  the  density  ratio  estimator  in  the  hetero- 
distributional  subspace  has  already  been  obtained  during  the  hetero-distributional  sub¬ 
space  search  procedure.  Thus,  we  do  not  need  an  additional  estimation  procedure — our 
final  solution  is  simply  given  by 


b 

r(x)  =  ^a^eiUx), 

£=1 

where  U  is  a  projection  matrix  obtained  by  the  LHSS  algorithm.  {ayjjLi  are  the  learned 
parameters  for  U,  which  have  been  obtained  and  used  when  computing  the  gradient  (see 
Lemma  2). 

This  expression  implies  that  if  the  dimensionality  is  not  reduced  (i.e.,  m  =  d),  the 
proposed  method  agrees  with  the  original  uLSIF  (see  Section  2.2).  Thus,  the  proposed 
method  could  be  regarded  as  a  natural  extension  of  uLSIF  to  high-dimensional  data. 

Given  the  true  dimensionality  m  of  the  hetero-distributional  subspace,  we  can  estimate 
the  hetero-distributional  subspace  by  the  LHSS  algorithm.  When  m  is  unknown,  we  may 
choose  the  best  dimensionality  based  on  the  CV  score  of  the  uLSIF  estimator.  We  refer 
to  our  proposed  procedure  D3-LHSS  (D-cube  LHSS;  Direct  Density-ratio  estimation  with 
Dimensionality  reduction  via  Least-squares  Hetero-distributional  Subspace  Search). 

The  complete  procedure  of  D3-LHSS  is  summarized  in  Figure  7.  A  MATLAB®  im¬ 
plementation  of  D3-LHSS  is  available  from 

‘http : //sugiyama-www . cs . titech. ac . jp/~sugi/ software/D3LHSS/’. 


4  Experiments 

In  this  section,  we  investigate  the  experimental  performance  of  the  proposed  method. 
We  employ  the  subspace  rotation  algorithm  explained  in  Section  3.3.4  in  our  D3-LHSS 
implementation.  In  uLSIF,  the  number  of  parameters  is  fixed  to  b  =  100;  the  Gaussian 
width  <t  and  the  regularization  parameter  A  are  chosen  based  on  cross-validation. 

4.1  Illustrative  Examples 

First,  we  illustrate  how  the  D3-LHSS  algorithm  behaves. 

As  explained  in  Section  1,  the  previous  D3  method,  D3-LFDA/uLSIF  (Sugiyama  et  ah, 
2010a),  has  two  potential  weaknesses: 

•  The  component  u  inside  the  hetero-distributional  subspace  and  its  complementary 
component  v  are  assumed  to  be  statistically  independent  (cf.  Section  3.1). 
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Input:  Two  sets  of  samples  {cc™}"™}  and  {Xje}]=\  on 

Output:  Density  ratio  estimator  r(x) 

For  each  reduced  dimension  rn  —  1,2 , ,d 
Initialize  embedding  matrix  Um  (e  Mmxd); 

Repeat  until  Um  converges 

Choose  Gaussian  width  o  and  regularization  parameter  A  by  CV; 

Update  U  by  some  optimization  method  (see  Section  3.3); 

end 

Obtain  embedding  matrix  Um  and  corresponding  density-ratio  estimator  rm(cc); 
Compute  its  CV  value  as  a  function  of  m; 

end 

Choose  the  best  reduced  dimensionality  m  that  minimizes  the  CV  score; 

Set  r(x )  =  fa(cc); 

Figure  7:  Pseudo  code  of  D3-LHSS. 


•  Separability  of  samples  drawn  from  two  distributions  implies  that  the  two  distri¬ 
butions  are  different,  but  non-separability  does  not  necessarily  imply  that  the  two 
distributions  are  equivalent.  Thus,  D3-LFDA/uLSIF  may  not  be  able  to  detect  the 
subspace  in  which  the  two  distributions  are  different,  but  samples  are  not  really 
separable. 

Here,  through  numerical  examples,  we  illustrate  these  weaknesses  of  D3-LFDA/uLSIF, 
and  show  these  problems  can  be  overcome  by  D3-LHSS.  Let  us  consider  two-dimensional 
examples  (i.e.,  d  =  2),  and  suppose  that  the  two  densities  pnu(x)  and  pde{x)  are  different 
only  in  the  one-dimensional  subspace  (i.e.,  m  =  1)  spanned  by  (1,0)T: 

X  =  (xW,£^)T  =  (u,v)T , 

Pnu(x)  =  p(v\u)pnu(u), 

Pde(x )  =  p(v\u)pde(u). 

Let  nnu  =  nde  =  1000.  We  use  the  following  three  datasets: 

“Rather-separate”  dataset  (Figure  8): 

p{v\u)  =  p{y)  =  N(v;  0,  l2), 

pnu(u)  =  N(u]  0,0. 52), 

pde(u )  =  0.51V(m;  -1,  l2)  +  0.5iV(u;  1,  l2), 

where  N (it,  p,  cr2)  denotes  the  Gaussian  density  with  mean  p  and  variance  cr 2  with 
respect  to  u.  This  is  an  easy  and  simple  dataset  for  the  purpose  of  illustrating  the 
usefulness  of  dimensionality  reduction  in  density  ratio  estimation. 
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“Highly-overlapped”  dataset  (Figure  9): 

p(v\u)  =  p{y)  =  N(v,  0,  l2), 

Pmi(u)  =  N(w,  0,  0.62), 

Pde(u)  =  N(u ;  0, 1.22). 

Since  v  is  independent  of  u,  D3-LFDA/uLSIF  is  still  applicable  in  principle.  How¬ 
ever,  wnu  and  ude  are  highly  overlapped  and  are  not  clearly  separable.  Thus  this 
dataset  would  be  hard  for  D3-LFDA/uLSIF. 

“Dependent”  dataset  (Figure  10): 

p(v\u)  =  N(v;  u,  l2), 

pnu(u)  =  N(u;  0,0. 52), 

pde(u )  =  0.51V(w;  -1,  l2)  +  0.5N(u;  1,  l2). 

In  this  dataset,  the  conditional  distribution  p{y\u)  is  common,  but  the  marginal 
distributions  pnu(v)  and  pc\e(v)  are  different.  Since  v  is  not  independent  of  u,  this 
dataset  would  be  out  of  scope  for  D3-LFDA/uLSIF. 

The  true  hetero-distributional  subspace  for  the  “rather-separate”  dataset  is  depicted 
by  the  dotted  line  in  Figure  8(a);  the  solid  line  and  the  dashed  line  depict  the  hetero- 
distributional  subspace  found  by  LHSS  and  LFDA  with  reduced  dimensionality  rn  —  1, 
respectively.  This  graph  shows  that  LHSS  and  LFDA  both  give  very  good  estimates  of 
the  true  hetero-distributional  subspace.  In  Figure  8(c),  Figure  8(d),  and  Figure  8(e), 
density  ratio  functions  estimated  by  the  plain  uLSIF  without  dimensionality  reduction, 
D3-LFDA/uLSIF,  and  D3-LHSS  for  the  “rather-separate”  dataset  are  depicted.  These 
graphs  show  that  both  D3-LHSS  and  D3-LFDA/uLSIF  give  much  better  estimates  of  the 
density  ratio  function  (see  Figure  8(b)  for  the  profile  of  the  true  density  ratio  function) 
than  the  plain  uLSIF  without  dimensionality  reduction.  Thus,  the  usefulness  of  dimen¬ 
sionality  reduction  in  density  ratio  estimation  was  illustrated. 

For  the  “highly-overlapped”  dataset  (Figure  9),  LHSS  gives  a  reasonable  estimate  of 
the  hetero-distributional  subspace,  while  LFDA  is  highly  erroneous  due  to  less  separability. 
As  a  result,  the  density  ratio  function  obtained  by  D3-LFDA/uLSIF  does  not  reflect  the 
true  redundant  structure  appropriately.  On  the  other  hand,  D3-LHSS  still  works  well. 

Finally,  for  the  “dependent”  dataset  (Figure  10),  LHSS  gives  an  accurate  estimate  of 
the  hetero-distributional  subspace.  However,  LFDA  gives  a  highly  biased  solution  since 
the  marginal  distributions  pnu(v)  and  pde{v)  are  no  longer  common  in  the  “dependent” 
dataset.  Consequently,  the  density  ratio  function  obtained  by  D3-LFDA/uLSIF  is  highly 
erroneous.  In  contrast,  D3-LHSS  still  works  very  well  for  the  “dependent”  dataset. 

The  experimental  results  for  the  “highly-overlapped”  and  “dependent”  datasets  illus¬ 
trated  typical  failure  modes  of  LFDA,  and  LHSS  was  shown  to  be  able  to  successfully 
overcome  these  weaknesses  of  LFDA. 
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(d)  r(x)  by  D3-LFDA/uLSIF.  (e)  r(x)  by  D3-LHSS. 


Figure  8:  “Rather-separate”  dataset. 
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(a)  Hetero-distributional  subspace 


(d)  r(x)  by  D3-LFDA/uLSIF. 


(e)  r{x)  by  D3-LHSS. 


Figure  9:  “Highly-overlapped”  dataset. 
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X 


X  x* 

°  xnu 

. True  Subspace 

- LHSS  Subspace 

-  -  -  LFDA  Subspace 


(a)  Hetero-distributional  subspace 


(d)  r(x)  by  D3-LFDA/uLSIF.  (e)  r(x)  by  D3-LHSS. 


Figure  10:  “Dependent”  dataset. 
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4.2  Evaluation  on  Artificial  Data 

Next,  we  systematically  compare  the  performance  of  the  proposed  D3-LHSS  with  that  of 
the  plain  uLSIF  and  D3-LFDA/uLSIF  for  high-dimensional  artificial  data. 

For  the  three  datasets  used  in  the  previous  experiments,  we  increase  the  entire  dimen¬ 
sionality  as  d  —  2,  3, . . . ,  10  by  adding  dimensions  consisting  of  standard  normal  noise. 
The  dimensionality  of  the  hetero-distributional  subspace  is  estimated  based  on  the  CV 
score  of  uLSIF.  We  evaluate  the  error  of  a  density  ratio  estimator  r( x)  by 


Error  :=  - 
2 


( r(x )  -  r(x))2pde(x) dx, 


(15) 


which  uLSIF  tries  to  minimize  (see  Section  2.2). 

The  left  graphs  in  Figure  11  show  the  density-ratio  estimation  error  averaged  over 
100  runs  as  functions  of  the  entire  input  dimensionality  d.  The  best  method  in  terms 
of  the  mean  error  and  comparable  methods  according  to  the  t-test  (Henkel,  1979)  at  the 
significance  level  1%  are  specified  by  ‘o’;  otherwise  methods  are  specified  by  ‘x’. 

These  plots  show  that,  while  the  error  of  the  plain  uLSIF  increases  rapidly  as  the  entire 
dimensionality  d  increases,  that  of  the  proposed  D3-LHSS  is  kept  moderate.  Consequently, 
the  proposed  method  consistently  outperforms  the  plain  uLSIF.  D3-LHSS  is  comparable 
to  D3-LFDA/uLS1F  for  the  “rather-separate”  dataset,  and  D3-LHSS  significantly  out¬ 
performs  D3-LFDA/uLSIF  for  the  “highly-overlapped”  and  “dependent”  datasets.  Thus, 
D3-LHSS  was  overall  shown  to  compare  favorably  with  the  other  approaches. 

The  choice  of  the  dimensionality  of  the  hetero-distributional  subspace  in  D3-LHSS 
and  D3-LFDA/uLSIF  is  illustrated  in  the  middle  and  right  columns  of  Figure  11;  the 
darker  the  color  is,  the  more  frequently  the  corresponding  dimensionality  is  chosen.  The 
plots  show  that  D3-LHSS  reasonably  identifies  the  true  dimensionality  (m  =  1  in  the  cur¬ 
rent  setup)  for  all  the  three  datasets,  while  D3-LFDA/uLSIF  performs  well  only  for  the 
“rather-separate”  dataset.  This  happened  because  D3-LFDA/uLSIF  cannot  find  appro¬ 
priate  low- dimensional  subspaces  for  the  “highly-overlapped”  and  “dependent”  datasets, 
and  therefore  the  CV  scores  misled  the  choice  of  subspace  dimensionality. 


4.3  Inlier-based  Outlier  Detection  for  Benchmark  Data 

Finally,  we  apply  the  proposed  method  to  inlier-based  outlier  detection,  i.e.,  Ending  out¬ 
liers  in  an  evaluation  dataset  based  on  another  “model”  dataset  that  only  contains  inlicrs 
(see  Section  A. 2  for  details). 

We  use  the  USPS  hand-written  digit  dataset  taken  from  the  UCI  Machine  Learning 
Repository  (Asuncion  &  Newman,  2007).  We  regard  samples  in  the  class  ‘F  as  inlicrs 
and  samples  in  other  classes  as  outliers.  We  randomly  take  500  samples  from  the  class 
‘1’,  and  assign  them  to  the  model  dataset.  Then  we  randomly  take  500  samples  from 
the  class  ‘1’  without  overlap,  and  25  samples  from  one  of  the  other  classes.  From  these 
samples,  density  ratio  estimation  is  performed  and  the  outlier  score  is  computed.  Since 
the  USPS  hand-written  digit  dataset  contains  10  classes  (i.e.,  from  ‘O’  to  ‘9’),  we  have  9 
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(a)  Density-ratio  estimation 
error 


4  5  6  7  8  9  10 

Entire  dimensionality  d 


(b)  Choice  of  Dimensionality 
by  D3-LHSS 


4  5  6  7  8  9  10 

Entire  dimensionality  d 


(c)  Choice  of  Dimensionality 
by  D3-LFDA/uLSIF 


(d)  Density-ratio  estimation 
error 


(e)  Choice  of  Dimensionality 
by  D3-LHSS 


(f)  Choice  of  Dimensionality 
by  D3-LFDA/uLSIF 


(g)  Density-ratio  estimation 
error 


(h)  Choice  of  Dimensionality 
by  D3-LHSS 


(i)  Choice  of  Dimensionality 
by  D3-LFDA/uLSIF 


Figure  11:  Top:  “Rather-separate”  dataset.  Middle:  “Highly-overlapped”  dataset.  Bot¬ 
tom:  “Dependent”  dataset.  Left:  Density-ratio  estimation  error  (15)  averaged  over  100 
runs  as  a  function  of  the  entire  data  dimensionality  d.  The  best  method  in  terms  of  the 
mean  error  and  comparable  methods  according  to  the  t-test  at  the  significance  level  1% 
are  specified  by  ‘o’;  otherwise  methods  are  specified  by  ‘x’.  Center:  The  dimensionality 
of  the  hetero-distributional  subspace  chosen  by  CV  in  LHSS.  Right:  The  dimensionality 
of  the  hetero-distributional  subspace  chosen  by  CV  in  LFDA. 
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different  tasks  in  total.  The  dimensionality  of  the  samples  is  d  =  256.  For  the  D3-LHSS 
and  D3-LFDA/uLSIF  methods,  we  choose  the  dimensionality  of  the  hetero-distributional 
subspace  from  m  —  1,  2, . . . ,  5  by  cross-validation. 

When  evaluating  the  performance  of  outlier  detection  methods,  it  is  important  to  take 
into  account  both  the  detection  rate  (i.e.,  the  amount  of  true  outliers  an  outlier  detection 
algorithm  can  find)  and  the  detection  accuracy  (i.e.,  the  amount  of  true  inkers  an  outlier 
detection  algorithm  misjudges  as  outliers).  Since  there  is  a  trade-off  between  the  detection 
rate  and  the  detection  accuracy,  we  adopt  the  area  wider  the  ROC  curve  (AUC)  as  our 
error  metric  (Bradley,  1997). 

The  mean  and  standard  deviation  of  AUC  scores  over  100  runs  with  different  random 
seeds  are  summarized  in  Table  1,  where  the  best  method  in  terms  of  the  mean  AUC 
and  comparable  methods  according  to  the  t-test  at  the  significance  level  1%  are  speci¬ 
fied  by  ‘°’.  The  table  shows  that  the  proposed  D3-LHSS  tends  to  outperform  the  plain 
uLSIF  and  D3-LFDA/uLSIF.  It  is  also  note  worthy  that  D3-LFDA/uLSIF  is  actually 
outperformed  by  the  plain  uLSIF — the  baseline  method.  This  is  perhaps  because  the 
numerator  and  denominator  datasets  are  highly  overlapped  in  outlier  detection  scenarios, 
so  D3-LFDA/uLSIF  performs  rather  poorly  (cf.  Figure  9) 

We  also  evaluate  the  performance  of  each  method  for  an  additional  test  dataset  which 
is  not  used  for  density  ratio  estimation.  The  test  dataset  consists  of  100  randomly  chosen 
samples  from  the  class  ‘1’  and  5  randomly  chosen  samples  from  the  outlier  class  (which 
is  the  same  as  the  evaluation  dataset).  The  results  are  summarized  in  Table  2,  showing 
that  the  advantage  of  the  proposed  method  is  still  valid  in  this  more  challenging  scenario. 


5  Conclusions 

Density  ratios  are  becoming  quantities  of  interest  in  the  machine  learning  and  data  mining 
communities  since  it  can  be  used  for  solving  various  important  data  processing  tasks  such 
as  non-stationarity  adaptation,  outlier  detection,  and  feature  selection  (Sugiyama  et  ah, 
2009;  Sugiyama  et  ah,  2011).  In  this  paper,  we  tackled  a  challenging  problem  of  estimating 
density  ratios  in  high-dimensional  spaces,  and  gave  a  new  procedure  in  the  framework  of 
Direct  Density-ratio  estimation  with  Dimensionality  reduction  (D3;  D-cube).  The  basic 
idea  of  D3  is  to  identify  a  subspace  called  the  hetero-distributional  subspace,  in  which  two 
distributions  (corresponding  to  the  numerator  and  denominator  of  the  density  ratio)  are 
different. 

In  the  existing  approach  of  D3  (Sugiyama  et  ah,  2010a),  the  hetero-distributional  sub¬ 
space  is  identified  by  finding  a  subspace  in  which  samples  drawn  from  the  two  distributions 
are  maximally  separated  from  each  other.  To  this  end,  supervised  dimensionality  reduc¬ 
tion  methods  such  as  local  Fisher  discriminant  analysis  (LFDA)  (Sugiyama,  2007)  are 
utilized.  This  approach  was  shown  to  work  well  when  the  components  inside  and  outside 
the  hetero-distributional  subspace  are  statistically  independent,  and  samples  drawn  from 
the  two  distributions  are  highly  separable  from  each  other  in  the  hetero-distributional 
subspace. 
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Table  1:  Outlier  detection  for  the  USPS  hand- written  digit  dataset  (d  =  256).  The 
means  (and  standard  deviations  in  the  bracket)  of  AUC  scores  over  100  runs  for  the 
evaluation  dataset  are  summarized.  The  best  method  in  terms  of  the  mean  AUC  value 
and  comparable  methods  according  to  the  t-test  at  the  significance  level  1%  are  specified 
by  ‘°’.  The  means  (and  standard  deviations  in  the  bracket)  of  the  chosen  dimensionality 
by  cross-validation  are  also  included  in  the  table. 


Da-LHSS 

D- 

;-LFDA/uLSIF 

Plain 

uLSIF 

Data 

AUC 

m 

AUC 

m 

AUC 

Digit  2 

°0.956 

(0.035) 

4.3 

(0.8) 

0.889 

(0.104) 

1.7 

(i.i) 

0.902 

(0.038) 

Digit  3 

°0.967 

(0.032) 

4.4 

(0.8) 

0.868 

(0.136) 

1.8 

(i.i) 

0.921 

(0.039) 

Digit  4 

°0.907 

(0.061) 

4.4 

(0.9) 

0.825 

(0.104) 

1.4 

(0.6) 

0.870 

(0.036) 

Digit  5 

°0.965 

(0.037) 

4.3 

(0.9) 

0.882 

(0.109) 

1.6 

(0.9) 

0.906 

(0.037) 

Digit  6 

°0.974 

(0.022) 

4.4 

(0.8) 

0.891 

(0.090) 

1.7 

(1.1) 

0.941 

(0.029) 

Digit  7 

°0.924 

(0.072) 

4.4 

(0.9) 

0.642 

(0.139) 

2.3 

(1.4) 

0.878 

(0.035) 

Digit  8 

°0.929 

(0.051) 

4.2 

(1.0) 

0.804 

(0.147) 

1.8 

(1.1) 

0.860 

(0.033) 

Digit  9 

°0.942 

(0.048) 

4.6 

(0.7) 

0.790 

(0.136) 

1.8 

(1.1) 

0.892 

(0.035) 

Digit  0 

°0.986 

(0.019) 

4.2 

(0.9) 

0.920 

(0.071) 

1.9 

(0.8) 

°0.979 

(0.019) 

Average 

0.950 

(0.051) 

4.4 

(0.9) 

0.835 

(0.142) 

1.8 

(1.1) 

0.905 

(0.049) 

Table  2:  Outlier  detection  for  the  USPS  hand-written  digit  dataset  ( d  =  256).  The  means 
(and  standard  deviations  in  the  bracket)  of  AUC  scores  over  100  runs  for  unlearned  test 
dataset  are  summarized. 


D;i-LHSS 

Da-LFDA/uLSIF 

Plain  uLSIF 

Data 

AUC 

m 

AUC 

m 

AUC 

Digit  2 

°0.946  (0.047) 

4.3  (0.8) 

0.817  (0.132) 

1.7  (1.1) 

0.905  (0.044) 

Digit  3 

°0.953  (0.061) 

4.4  (0.8) 

0.780  (0.161) 

1.8  (1.1) 

0.924  (0.045) 

Digit  4 

°0.880  (0.094) 

4.4  (0.9) 

0.767  (0.121) 

1.4  (0.6) 

°0.870  (0.063) 

Digit  5 

°0.954  (0.057) 

4.3  (0,9) 

0.813  (0.142) 

1.6  (0.9) 

0.906  (0.047) 

Digit  6 

°0.959  (0.052) 

4.4  (0.8) 

0.806  (0.141) 

1.7  (1.1) 

0.939  (0.040) 

Digit  7 

°0.909  (0.079) 

4.4  (0.9) 

0.689  (0.173) 

2.3  (1.4) 

0.877  (0.056) 

Digit  8 

°0.903  (0.078) 

4.2  (1.0) 

0.741  (0.173) 

1.8  (1.1) 

0.861  (0.049) 

Digit  9 

°0.932  (0.072) 

4.6  (0.7) 

0.793  (0.128) 

1.8  (1.1) 

0.894  (0.054) 

Digit  0 

°0.982  (0.039) 

4.2  (0.9) 

0.859  (0.098) 

1.9  (0.8) 

°0.982  (0.022) 

Average 

0.935  (0.073) 

4.4  (0.9) 

0.785  (0.150) 

1.8  (1.1) 

0.906  (0.060) 

Direct  Density-ratio  Estimation  with  Dimensionality  Reduction 


27 


However,  as  illustrated  in  Section  4.1,  violation  of  these  conditions  can  cause  signif¬ 
icant  performance  degradation.  This  problem  can  be  overcome  in  principle  by  finding  a 
subspace  such  that  two  conditional  distributions  are  similar  to  each  other  in  its  comple¬ 
mentary  subspace.  However,  comparing  conditional  distributions  is  a  cumbersome  task. 
To  cope  with  this  problem,  we  first  proved  that  the  hetero-distributional  subspace  can 
be  characterized  as  the  subspace  in  which  two  marginal  distributions  are  maximally  dif¬ 
ferent  under  the  Pearson  divergence  (Lemma  1).  Based  on  this  lemma,  we  proposed  a 
new  algorithm  for  finding  the  hetero-distributional  subspace  called  Least-squares  Hetero- 
distributional  Subspace  Search  (LHSS).  Since  a  density-ratio  estimation  method  is  uti¬ 
lized  during  hetero-distributional  subspace  search  in  the  LHSS  procedure,  an  additional 
density-ratio  estimation  step  is  not  needed  after  hetero-distributional  subspace  search. 
Thus,  two  steps  in  the  previous  method  (hetero-distributional  subspace  search  followed 
by  density  ratio  estimation  in  the  identified  subspace)  were  merged  into  a  single  step  (see 
Figure  2).  The  proposed  single-shot  procedure,  Eft -LHSS  (D-cube  LHSS),  was  shown  to 
be  able  to  overcome  the  limitations  of  the  D3-LFDA/uLSIF  approach  through  experi¬ 
ments. 

In  the  experiments  in  Section  4,  we  employed  the  subspace  rotation  algorithm  ex¬ 
plained  in  Section  3.3.4  in  our  D3-LHSS  implementation.  Although  we  experimentally 
found  that  the  subspace  rotation  algorithm  is  useful,  this  does  not  necessarily  mean  that 
subspace  rotation  is  always  the  best  performing  algorithm.  Other  approaches  explained  in 
Section  3.3  may  also  be  useful  in  some  situations.  Further  investigating  the  optimization 
issue  is  an  important  future  work. 

We  gave  a  general  proof  of  the  data  processing  inequality  (Lemma  1)  for  a  class  of  /- 
divergences  (Ali  &  Silvey,  1966;  Csiszar,  1967).  Thus,  the  hetero-distributional  subspace 
is  characterized  not  only  by  the  Pearson  divergence,  but  also  by  any  /-divergences.  Since 
a  framework  of  density  ratio  estimation  for  /-divergences  has  been  provided  in  Nguyen 
et  al.  (2010),  an  interesting  future  direction  is  to  develop  hetero-distributional  subspace 
search  methods  for  general  /-divergences. 
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A  Usage  of  Density  Ratios  in  Data  Processing 

We  are  interested  in  estimating  density  ratios  since  they  are  useful  in  various  data  pro¬ 
cessing  tasks.  Here,  we  briefly  review  possible  usage  of  density  ratios  (Sugiyama  et  ah, 
2009;  Sugiyama  et  al.,  2011). 

A.l  Covariate  Shift  Adaptation 

Covariate  shift  (Shimodaira,  2000)  is  a  situation  in  supervised  learning  where  input  dis¬ 
tributions  change  between  the  training  and  test  phases,  but  the  conditional  distribution  of 
outputs  given  inputs  remains  unchanged.  Extrapolation  (i.e.,  prediction  is  made  outside 
the  training  region)  would  be  a  typical  example  of  covariate  shift.  Standard  learning  tech¬ 
niques  such  as  maximum  likelihood  estimation  are  biased  under  covariate  shift;  the  bias 
caused  by  covariate  shift  can  be  asymptotically  canceled  by  weighting  the  loss  function  ac¬ 
cording  to  the  importance2  (Shimodaira,  2000;  Zadrozny,  2004;  Sugiyama  &  Muller,  2005; 
Sugiyama  et  ah,  2007;  Quinonero-Candela  et  ah,  2009;  Sugiyama  &  Kawanabe,  2010). 


2The  test  input  density  over  the  training  input  density  is  referred  to  as  the  importance  in  the  context 
of  importance  sampling  (Fishman,  1996). 
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The  basic  idea  of  covariate  shift  adaptation  is  summarized  in  the  following  importance 
sampling  identity: 

E  [loss (a;)]  =  /  loss (x)pnu(x) dx 

Pnu(*)  J 

=  /  loss (cc)r(cc)pde(*) dec  =  E  (loss(cc)r(cc)]. 

J  Pde(x) 

That  is,  the  expectation  of  a  function  loss  (a;)  over  pnu(x )  can  be  computed  by  the 
importance-weighted  expectation  of  loss  (a;)  over  pde(x).  Similarly,  standard  model  selec¬ 
tion  criteria  such  as  cross-validation  (Stone,  1974;  Wahba,  1990)  or  Akaike ’s  information 
criterion  (Akaike,  1974)  lose  their  unbiasedness  under  covariate  shift;  proper  unbiasedness 
can  be  recovered  by  modifying  the  methods  based  on  importance  weighting  (Shimodaira, 
2000;  Zadrozny,  2004;  Sugiyama  &  Muller,  2005;  Sugiyama  et  ah,  2007).  Furthermore, 
the  performance  of  active  learning  or  the  experiment  design,  i.e.,  the  training  input  dis¬ 
tribution  is  designed  by  the  user  to  enhance  the  generalization  performance,  could  also 
be  improved  by  the  use  of  the  importance  (Wiens,  2000;  Kanamori  &  Shimodaira,  2003; 
Sugiyama,  2006;  Sugiyama  &  Nakajima,  2009). 

Thus,  the  importance  plays  a  central  role  in  covariate  shift  adaptation,  and  density- 
ratio  estimation  methods  could  be  used  for  reducing  the  estimation  bias  under  covariate 
shift.  Examples  of  successful  real- world  applications  include  brain-computer  interface 
(Sugiyama  et  ah,  2007),  robot  control  (Hachiya  et  ah,  2009a;  Akiyama  et  ah,  2010; 
Hachiya  et  ah,  2009b),  speaker  identification  (Yarnada  et  ah,  2010),  age  prediction  from 
face  images  (Ueki  et  ah,  2010),  wafer  alignment  in  semiconductor  exposure  apparatus 
(Sugiyama  &  Nakajima,  2009),  and  natural  language  processing  (Tsuboi  et  ah,  2009).  A 
similar  importance-weighting  idea  also  plays  a  central  role  in  domain  adaptation  (Storkey 
&  Sugiyama,  2007)  and  multi-task  learning  (Bickel  et  ah,  2008). 

A. 2  Inlier-based  Outlier  Detection 

Let  us  consider  an  outlier  detection  problem  of  finding  irregular  samples  in  a  dataset 
(“evaluation  dataset”)  based  on  another  dataset  (“model  dataset”)  that  only  contains 
regular  samples.  Defining  the  density  ratio  over  the  two  sets  of  samples,  we  can  see 
that  the  density  ratio  values  for  regular  samples  are  close  to  one,  while  those  for  outliers 
tend  to  be  significantly  deviated  from  one.  Thus,  the  density  ratio  value  could  be  used 
as  an  index  of  the  degree  of  outlyingness  (Hido  et  ah,  2008;  Smola  et  ah,  2009;  Hido 
et  ah,  2010).  Since  the  evaluation  dataset  usually  has  a  wider  support  than  the  model 
dataset,  we  regard  the  evaluation  dataset  as  samples  corresponding  to  pde(x)  and  the 
model  dataset  as  samples  corresponding  to  pnu(x).  Then  outliers  tend  to  have  smaller 
density-ratio  values  (i.e.,  close  to  zero).  As  such,  density-ratio  estimation  methods  could 
be  employed  in  outlier  detection  scenarios. 

A  similar  idea  could  be  used  for  change-point  detection  in  time-series  (Kawahara  & 
Sugiyama,  2009). 
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A. 3  Conditional  Density  Estimation 

Suppose  we  are  given  n  i.i.d.  paired  samples  {(xk,  Vk)}k= 1  drawn  from  a  joint  distribution 
with  density  q(x,y).  The  goal  is  to  estimate  the  conditional  density  q(y\x).  When  the 
domain  of  x  is  continuous,  conditional  density  estimation  is  not  straightforward  since  a 
naive  empirical  approximation  cannot  be  used  (Bishop,  2006;  Takeuchi  et  ah,  2009). 

In  the  context  of  density  ratio  estimation,  let  us  regard  {(xk,yk)}k=1  as  samples 
corresponding  to  the  numerator  of  the  density  ratio  and  {ccfc}jl=1  as  samples  corresponding 
to  the  denominator  of  the  density  ratio,  i.e.,  we  consider  the  density  ratio  defined  by 

r(x,y)  ■=  q^X,y}  =  q(y\x), 
q(x) 

where  q( x)  is  the  marginal  density  of  x.  Then  a  density-ratio  estimation  method  directly 
gives  an  estimate  of  the  conditional  density  (Sugiyama  et  ah,  2010c). 

When  y  is  categorical,  the  same  method  can  be  used  for  probabilistic  classification 
(Sugiyama,  2010). 

A. 4  Mutual  Information  Estimation 

Suppose  we  are  given  n  i.i.d.  paired  samples  {( xk ,  yk)}k= i  drawn  from  a  joint  distribution 
with  density  q(x,y).  Let  us  denote  the  marginal  densities  of  x  and  y  by  q(x)  and  q(y), 
respectively.  Then  mutual  information  MI(W,  Y)  between  random  variables  X  and  Y  is 
defined  by 


MI(X,W)  :=  f  'j  q(x,y)  log  ^£-dxdy} 

which  plays  a  central  role  in  information  theory  (Cover  &  Thomas,  2006). 

Let  us  regard  {(xk,yk)}k=1  as  samples  corresponding  to  the  numerator  of  the  density 
ratio  and  {(xk,  yk')}k  k'=i  as  samples  corresponding  to  the  denominator  of  the  density 
ratio,  i.e., 


r{x,y) 


q(x,y) 

q{x)q(y)' 


Then  mutual  information  can  be  directly  estimated  using  a  density-ratio  estimation 
method  (Suzuki  et  ah,  2008;  Suzuki  et  ah,  2009b).  General  divergence  functionals  can 
also  be  estimated  in  a  similar  way  (Nguyen  et  ah,  2010). 

Mutual  information  can  be  used  for  measuring  independence  between  random  vari¬ 
ables  (Kraskov  et  ah,  2004;  Hulle,  2005)  since  it  vanishes  if  and  only  if  X  and  Y  are 
statistically  independent.  Thus  density-ratio  estimation  methods  are  applicable,  e.g., 
to  variable  selection  (Suzuki  et  ah,  2009a),  independent  component  analysis  (Suzuki  & 
Sugiyama,  2009),  supervised  dimensionality  reduction  (Suzuki  &  Sugiyama,  2010),  and 
causal  inference  (Yamada  &  Sugiyama,  2010). 
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B  Proof  of  Lemma  1 


Here,  let  us  consider  the  /- divergences  (Ali  &  Silvey,  1966;  Csiszar,  1967)  and  prove  a 
similar  inequality  for  a  broader  class  of  divergences.  An  /-divergence  is  defined  using  a 
convex  function  /  such  that  /( 1)  =  0  as 


If[Pnu{x)i  Pde(x)_  •  /  Pde{x)  f 


Prm(X, 

Pde(x ) 


dan 


The  /-divergence  is  reduced  to  the  Kullback-Leibler  divergence  if 

f(t)  =  -  log  t, 

and  the  Pearson  divergence  if 

m  =  \  (t  - 1)2. 


Using  Jensen’s  inequality  (Bishop,  2006),  we  have 


If\pnvL(x),pde(x)]  = 

> 


pde(v\u)pde(u)f 


Pnu(v\u)pnu(u ) 


pde(v\u)pde(u) 


Pde(u)f 

Pde(u)f 


I  pde(v\u ) 

Pnn(u)  f 


Pde(U) 

f  f  w  ( Pnn(u) 

=  /  Pde(u)f  - — 

J  \Pde(u) 

=  If[pnu(u),pde(u)]. 


Pnxi(v\u)Pim(u) 
pde(v\u)pde(u ) 

pnu(n|w)dn^  d u 


dudv 


dn  d  u 


dw 


Thus,  we  have 


Jf\p  nu  (*),pde(aj)]  -  I f\p  nu(^)  •>  Pde(u)}  >  0, 
and  the  equality  holds  if  and  only  if  pmi(v\u)  =  pde(v\u). 


C  Proof  of  Lemma  2 

For 

F  =  (H  +  \Ib)-\ 

PD[pnu(w),pde(w)]  can  be  expressed  as 
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Thus,  its  partial  derivative  with  respect  to  U  is  given  by 

<9P D  \  ^  ^  dhf  1  \  y  y  dFiy 

-gu=^aeau  +  2^h‘h,'-dir 

e=i  iy= 1 

Since 

dB^  igg  i 

dt  dt B 

holds  for  a  square  invertible  matrix  B  (Petersen  &  Pedersen,  2008),  it  holds  that 

=  -(£  +  +  A/b)-1. 

OUk,k'  OUk,k' 

Then  we  have 


hfhf: 


kF  J  iy 


T  5^  T  T 

=  -h  (H  +  \Ib)-i——  (H  +  \Ib)~lh 


Substituting  this  into  Eq.(16),  we  obtain  Eq.(8).  Eqs.(9)  and  (10)  are  clear  from  Eqs.(3) 
and  (2).  Finally,  we  prove  Eq.(ll).  The  basis  function  ^( u )  can  be  expressed  as 


ipe(u)  =  if) e{Ux )  =  exp 


\mx-dtW 


Since 


=  2 AaT a  (Petersen  &  Pedersen,  2008),  we  have 


d-W  =  -  ^(*  -  <4>(*  -  «4)T  exp  (-Myxhl 


from  which  we  obtain  Eq.(ll). 

D  Proof  of  Lemma  3 

The  proof  we  provide  here  essentially  follows  the  argument  in  Plumbley  (2005). 
For 


U  U  o 

W  =  ,  W  0  = 

V  V0 
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rotation  W  from  some  W o  can  be  expressed  as  follows  (Plumb ley,  2005): 

W  =  exp(M)W0,  (17) 

where  M  is  some  skew-symmetric  matrix.  Let  us  consider  the  space  of  skew-symmetric 
matrices,  and  let  E  be  an  element  in  that  space  with  unit  length.  Then  the  gradient  of 
a  function  PD (M)  with  respect  to  1V4",  in  this  space  is  given  as  an  element  whose 

inner  product  E^j  is  equal  to  the  derivative  of  PD(lVf)  in  the  direction  E  (Plumbley, 

2005).  Thus,  for  M  =  tE  with  t  being  a  scalar,  we  have 

<9PD  (tE)  _  I  dfb(M) 
dt  ““  \  (JM 


If  we  adopt  Eq.(12)  as  the  inner  product  of  the  space  of  skew-symmetric  matrices,  we 
have 


dPD  (tE) 
dt 


apfi(M)  ^ 

m  E  )  ■ 


(18) 


On  the  other  hand,  from  Eq.(17)  with  M  =  tE,  can  be  expressed  as  follows 
(Petersen  &  Pedersen,  2008): 

dW 

—  =  Eexp(tE)W0  =  EW. 

Then  c)I  can  be  expressed  as 


dPD(tE)  _  (  dPDdW 


dt 


dW  dt 


=  tr|^T£l 


(19) 


Since  E  is  skew-symmetric,  it  can  be  expressed  as 


E  =  1E  4 
2 

Substituting  this  into  Eq.(19),  we  have 

dfb(tE)  if  dfb 

dt  “  2tr  1  dW 


-E  =  -E 
2  2 


-Et. 


T  7-lT 


W]  E 


1 

~  2tr 


<9PD 

~dW 


W  E 


1 

“  2tr 


1 

““  2U 


c)PD 


T  7-lT 


dW 

"<9PD 


W]  E 


-\b\wm  EJ 


dW 


w  -w 


.(9PD 

dW 


E 


T 


(20) 
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Combining  Eqs.(18)  and  (20),  we  have 


<9PD 

dM 


<9PD 

~dW 


W 


T 


-  w 


dfbT 

dW 


1 fuT  -c/(®)T 

9PDttT  (  gPbA7 

dv  u  v  \au  ) 


QPDt^T 

au  V 

9PD  t^T 

dV  V 


U 

V 


(21) 


Eq.(ll)  implies  that  9 UT  is  symmetric.  Then  Eqs.(8)  and  (9)  imply  that  jjku 


dH, 


and  U1^rf  UT  are  also  symmetric.  Consequently,  Eq.(10)  imply  that  %yU  1  is  symmetric: 


c9PD  j 

au 


rT  • 


<9PD 

~dU 


UT 


u 


<9PDT 

~dU 


Since  the  range  of  V  is  assumed  to  be  orthogonal  to  the  range  of  U  (see  Section  3.1),  PD 
is  independent  of  V,  and  thus  we  have 


<9PD 

~dV 


O 


where  Ody  is  the  d  x  d!  matrix  with  all  zeros.  Then  Eq.(21)  yields 


<9PD 

Om,m 

<9PD  ‘ 

au  V 

dM  “ 

t  c9PD  yT  '|T 

L  \  dU  V  > 

1 

1 

1 

6 

which  concludes  the  proof. 
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Abstract 

Density  ratio  estimation  has  gathered  a  great  deal  of  attention  recently  since  it  can 
be  used  for  various  data  processing  tasks.  In  this  paper,  we  consider  three  meth¬ 
ods  of  density  ratio  estimation:  (A)  the  numerator  and  denominator  densities  are 
separately  estimated  and  then  the  ratio  of  the  estimated  densities  is  computed,  (B) 
a  logistic  regression  classifier  discriminating  denominator  samples  from  numerator 
samples  is  learned  and  then  the  ratio  of  the  posterior  probabilities  is  computed, 
and  (C)  the  density  ratio  function  is  directly  modeled  and  learned  by  minimizing 
the  empirical  Kullback-Leibler  divergence.  We  first  prove  that  when  the  numerator 
and  denominator  densities  are  known  to  be  members  of  the  exponential  family,  (A) 
is  better  than  (B)  and  (B)  is  better  than  (C).  Then  we  show  that  once  the  model 
assumption  is  violated,  (C)  is  better  than  (A)  and  (B).  Thus  in  practical  situations 
where  no  exact  model  is  available,  (C)  would  be  the  most  promising  approach  to 
density  ratio  estimation. 


Keywords 

density  ratio  estimation,  density  estimation,  logistic  regression,  asymptotic  analysis, 
Gaussian  assumption. 
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1  Introduction 

The  ratio  of  two  probability  density  functions  has  been  demonstrated  to  be  useful  in 
various  data  processing  tasks  [21],  such  as  non-stationarity  adaptation  [18,  35,  23,  22, 
17,  27],  outlier  detection  [7,  19],  conditional  density  estimation  [26],  feature  selection 
[31,  30],  feature  extraction  [29],  and  independent  component  analysis  [28].  Thus  accurately 
estimating  the  density  ratio  is  an  important  and  challenging  research  topic  in  the  machine 
learning  and  data  mining  communities. 

A  naive  approach  to  density  ratio  estimation  is  (A)  density  ratio  estimation  by  separate 
maximum  likelihood  density  estimation — first  the  numerator  and  denominator  densities 
are  separately  estimated  and  then  the  ratio  of  the  estimated  densities  is  computed.  How¬ 
ever,  density  estimation  is  substantially  more  difficult  than  density  ratio  estimation  and 
the  above  two-shot  process  of  first  estimating  the  densities  and  then  taking  their  ratio 
is  thought  to  be  less  accurate.  To  cope  with  this  problem,  various  alternative  methods 
have  been  developed  recently,  which  allow  one  to  estimate  the  density  ratio  without  going 
through  density  estimation  [16,  8,  15,  25,  10]. 

In  this  paper,  we  consider  the  following  two  methods  in  addition  to  the  method  (A): 
(B)  density  ratio  estimation  by  logistic  regression  [16,  3,  1] — a  logistic  regression  classifier 
discriminating  numerator  samples  from  denominator  samples  is  used  for  density  ratio  esti¬ 
mation,  and  (C)  direct  density  ratio  estimation  by  empirical  Kullback-Leibler  divergence 
minimization  [15,  25] — the  density  ratio  function  is  directly  modeled  and  learned.  The 
goal  of  this  paper  is  to  theoretically  compare  the  accuracy  of  these  three  density  ratio 
estimation  schemes. 

We  first  prove  that  when  the  numerator  and  denominator  densities  are  known  to  be 
members  of  the  exponential  family,  (A)  is  better  than  (B)  and  (B)  is  better  than  (C).  The 
fact  that  (A)  is  better  than  (B)  could  be  regarded  as  an  extension  of  the  existing  result 
for  binary  classification  [5] — estimating  data  generating  densities  by  maximum  likelihood 
estimation  has  higher  statistical  efficiency  than  logistic  regression  in  classification  scenar¬ 
ios.  On  the  other  hand,  the  fact  that  (B)  is  better  than  (C)  follows  from  the  fact  that 
(B)  has  the  smallest  asymptotic  variance  in  a  class  of  semi-parametric  estimators  [16]. 

We  then  show  that  when  the  model  assumption  is  violated,  (C)  is  better  than  (A) 
and  (B).  Our  statement  is  that  the  estimator  obtained  by  (C)  converges  to  the  projection 
of  the  true  density  ratio  function  onto  the  target  parametric  model  (i.e.,  the  optimal  ap¬ 
proximation  in  the  model),  while  the  estimators  obtained  by  (A)  and  (B)  do  not  generally 
converge  to  the  projection. 

Since  model  misspecification  would  be  a  usual  situation  in  practice,  (C)  is  the  most 
promising  approach  in  density  ratio  estimation.  In  a  regression  framework,  an  asymptotic 
analysis  with  a  similar  spirit  exists  [14], 

2  Density  Ratio  Estimation 

In  this  section,  we  formulate  the  problem  of  density  ratio  estimation  and  review  three 
density  ratio  estimators. 
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2.1  Problem  Formulation 

Let  X  (c  be  the  data  domain  and  suppose  we  are  given  independent  and  identically 
distributed  (i.i.d.)  samples  {a:fu}”=1  drawn  from  a  distribution  with  density  p*m(x)  and 
i.i.d.  samples  {a3je}”=1  drawn  from  another  distribution  with  density  pAe(x): 

{*“}*= 1 

{*?}?=!  Pde(*)- 

The  subscripts  ‘nu’  and  ‘de’  denote  ‘numerator’  and  ‘denominator’,  respectively.  We 
assume  that  the  latter  density  pde(x)  is  strictly  positive,  i.e., 

p*de(x)  >0,  Vx  E  X. 

The  problem  we  address  in  this  paper  is  to  estimate  the  density  ratio 


from  samples  {cc"u}"=1  and  {a^e}”=1. 

The  goal  of  this  paper  is  to  theoretically  compare  the  performance  of  the  following 
three  density  ratio  estimators: 

(A)  Density  ratio  estimation  by  separate  maximum  likelihood  density  estimation  (see 
Section  2.3  for  details), 

(B)  Density  ratio  estimation  by  logistic  regression  [16,  3,  1]  (see  Section  2.4  for  details), 

(C)  Direct  density  ratio  estimation  by  empirical  Kullback-Leibler  divergence  minimiza¬ 
tion  [15,  25]  (see  Section  2.5  for  details). 

2.2  Measure  of  Accuracy 

Let  us  consider  the  unnormalized  Kullback-Leibler  divergence  [2]  from  the  true  density 
p„n(x)  to  its  estimator  r(x)p* je(a;): 

UKL(Pnull^-  Pde)  :=  I  P*nn(X )  log  ~  1  +  /  ^(X)Pde(X)dx-  W 

UKL(p*u(a;)||r(cc)p^e(a;))  is  non-negative  for  all  r  and  vanishes  if  and  only  if  r  =  r* .  If 
r(x)pAe(x)  is  normalized  to  be  a  probability  density  function,  i.e., 


r{x)p*de{x)dx  =  1, 
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then  the  unnormalized  Kullback-Leibler  divergence  is  reduced  to  the  ordinary  Kullback- 
Leibler  divergence  [12]: 

KL(p;jf  •  P*J  :=  I P*nu{x )  log  (2) 

In  our  theoretical  analysis,  we  use  the  expectation  of  UKL(p*u||r  •  p^e)  over  {a;fu}”=1  and 
{a3jc}”=1  as  the  measure  of  accuracy  of  a  density  ratio  estimator  r(x): 

J(r):=E[UKLKJ|F.pi)],  (3) 

where  E  denotes  the  expectation  over  {cc'lu}”=1  and  {a^e}™=1. 

In  the  rest  of  this  section,  the  three  methods  of  density  ratio  estimation  we  are  dealing 
with  are  described  in  detail. 


2.3  Method  (A):  Density  Ratio  Estimation  by  Separate  Maxi¬ 
mum  Likelihood  Density  Estimation 

For  p*u(cc)  and  p^e(x),  two  parametric  models  pnu(x]  0nn)  and  Pde(x ;  #de)  such  that 


j Pimi^X,  0nu)da3  —  1,  V0nu  G  ©nu, 

Pnn{x J  0nu)  >0,  \/x  G  X,  \/0rm  G  0nu, 


J ‘ Pdei^X,  0de)d33  1  V0de  £  ©de; 


Pde{.X]  6de)  >0  \/x  <E  X,  \/Gde  G  ©de, 


are  prepared.  Then  the  maximum  likelihood  estimators  6nu  and  Ode  are  computed  sepa¬ 
rately  from  {*,nu}[Li  and  {x(-e}'j=l: 


6im  :=  argrnax 

^nu^^nu 


Ode  '■=  argrnax 

^de^©de 


Y  logPnu(*?U;  0< 


1=1 

n 


Yl°gPde(xf]0 
.3= 1 


de  ) 


Note  that  the  maximum  likelihood  estimators  011U  and  Ode  minimize  the  empirical 
Kullback-Leibler  divergences  from  the  true  densities  p*u(:e)  and  Pde(a3)  to  their  models 
Pnn(x;  0nu)  and  Pde{x]  0de),  respectively: 


0nu  =  argmin 


0nUe©n 


n 


n 


i= 1 


log 


Pi 


ixt 


Pn 


[X; 


k  Or 


Ode  =  argmin 

^de^®de 


1 

n 


n 


log 


Pdefof) 

Pnn(xp\  Ode) 
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Finally,  a  density  ratio  estimator  is  constructed  by  taking  the  ratio  of  the  estimated 
densities: 

f-  l'x'\  Pnu(xl  &nu)  (  1  V-"'  Pnu(^j6;  ^nu)  \ 

pde(x;Ode)  \n  j=1  pde(xf]dde)J 

where  the  estimator  is  normalized  so  that 

1  n 

-£>(*?)  =  1. 

3= 1 

2.4  Method  (B):  Density  Ratio  Estimation  by  Logistic  Regres¬ 
sion 

Let  us  assign  a  selector  variable  y  =  ‘nu’  to  samples  drawn  from  p*n u(cc)  and  y  =  ‘de’  to 
samples  drawn  from  p^e(x),  i.e.,  the  two  densities  are  written  as 

P*nn(x )  =  q*(X\y  =  W), 

Pde(x)  =  Q*(x\y  =  ‘de’). 


Since 


q*(x\y  =  ‘nu’) 
q*(x\y  =  ‘de’) 


q*(y  =  ‘nu’|cc)(/*(cc) 
q*  (y  =  ‘nu’) 
q*(y  =  ‘de’|cc)g*(;c) 
q*(y  =  ‘de’) 


the  density  ratio  can  be  expressed  in  terms  of  y  as 


q*(y  =  ‘nu’|cc)  q*(y  =  ‘de’) 
q*(y  =  ‘nu’)  q*(y  =  lde,\x) 
q*{y  =  ‘nu’|cc) 
q*(y  =  ‘de’ |a?)  ’ 


where  we  used  the  fact  that 

q*(y  =  ‘nu’)  =  q*(y  =  ‘de’)  =  ^ 

in  the  current  setup. 

The  conditional  probability  q*(y\x)  could  be  approximated  by  discriminating  {cc”u}”=1 
from  {a;je})1=1  using  a  logistic  regression  classiher,  i.e.,  for  a  parametric  function  r(x;G) 
such  that 


r(x;  0)  >0,  \/x  e  df,  V0  G  0, 


(4) 
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the  conditional  probabilities  q*(y  =  ‘nu’|cc)  and  q*(y  =  ‘de’ | tc)  are  modeled  by 

r(cc;  6) 


q(y  =  ‘nu’lcc;  0)  = 
q(y  =  ‘de’|cr;0)  = 


1  +  r(x:  0)  ’ 
1 

1  +  r(x:  6) 


Then  the  maximum  likelihood  estimator  0 b  is  computed  from  {cc“u}™=1  and  {cc^e}”=1: 


0B  :=  argmax 
0e© 


y  r{x™-G)  yh 

^  g  1  +  r(cc?u;  9)  ^  g 


i= 1 


3= 1 


1  +  r( cc1)6;  6) 


(5) 


Note  that  the  maximum  likelihood  estimator  0 b  minimizes  the  empirical  Kullback-Leibler 
divergences  from  the  true  density  q*(x,y)  to  its  estimator  q(y |cc;  9)q*{ x): 


9  b  =  argmin 

0e6 


log _ y  =  nu' 


i=l 


q(y  =  ‘nu’|ccfu;  9)q*(xfu) 
q*(xf,y  =  ‘de’) 


2 n  '  °  q(y  =  Cde’|cc^e;  0)q*(xf*) 
j—i  j  j 

Finally,  a  density  ratio  estimator  is  constructed  by  taking  the  ratio  of  q(y  =  ‘nu’lcc;  0b) 
and  q(y  =  ‘de’ | cc;  0b)  with  proper  normalization: 

-l 


rB(cc)  := 


q(y  =  ‘nu’l®;  0B)  /  lA  q(y  =  ‘nu’|crf ;  0 


y 

q(y  =  ‘de’lcr;  0B)  \n~t  q(y  =  ‘de’|crf ;  0B) 


-l 


=  r(x ;  0B)  f  -  r(xf]  0 


3= 1 


2.5  Method  (C):  Direct  Density  Ratio  Estimation  by  Empirical 
Unnormalized  Kullback-Leibler  Divergence  Minimization 

For  the  density  ratio  function  r*( cc),  a  parametric  model  r(cc;  0)  such  that  Eq.(4)  is  fulhlled 
is  prepared.  Then  the  following  estimator  0 c  is  computed  from  {ccfu}™=1  and  {ccde}^=1 ; 


0c  :=  argmax 
0e© 


yiogrK"”;®)-^  *■(**;«) 

3=1 


i=  1 


(6) 


Note  that  0c  minimizes  the  empirical  unnormalized  Kullback-Leibler  divergence  from  the 
true  density  p*u(cc)  to  its  estimator  r(cc)p^e(cc): 


0  c  =  argmin 
0e© 


n 


y\og 


PluW 


1 


1=1 


r(xr)p*de(xi 


-i  +  -y 

n  ' 


—  >  r(cc 
n  •  1 

J=1 


de\ 
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Finally,  a  density  ratio  estimator  is  obtained  by 


rc(x) 


3  Exponential  Models 

In  our  theoretical  analysis,  we  employ  the  exponential  model ,  which  is  explained  in  this 
section. 

3.1  Exponential  Models  for  Densities  and  Ratios 

We  use  the  following  exponential  model  for  the  densities  p*mi(x)  and  p^e(x). 

p(x;  G)  =  h(x)  exp  [GJ  £(x)  —  tp(G)}  ,  0  6  0,  (7) 

where  h(x)  is  a  base  measure,  ffix)  is  a  sufficient  statistic ,  < p{G )  is  a  normalization  factor, 
and  T  denotes  the  transpose  of  a  vector  [13].  The  exponential  model  includes  various 
popular  models  as  special  cases,  e.g.,  the  normal,  exponential,  gamma,  chi-square,  and 
beta  distributions. 

Correspondingly,  we  use  the  following  exponential  model  for  the  ratio  r*(x). 

r(x;Offi0)  =  exp  {90  +  GJ £(x)}  ,  9  e  0,  0o6l.  (8) 

3.2  Method  (A) 

For  the  exponential  model  (7),  the  maximum  likelihood  estimators  Gnu  and  Gc\e  are  given 
by 

n 

£  eT«(*r)  -  nV(6)  , 

.  7=1 
n 

Y,»Ti(xf)-nV(e)  , 

.i=i 

where  irrelevant  constants  are  ignored.  The  density  ratio  estimator  ffi(x)  for  the  expo¬ 
nential  density  model  is  expressed  as 

(1  U 

“5ZeXp{^A  £(*?)} 

3= 1 


Gnu  =  argmax 

6»nu€0 


G  de  =  argmax 

OdeE@ 


G a  :=  G, 


G 


de* 


where 
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One  can  use  the  other  estimator  such  as 

rA(x)  =  exp  j^A  £(*)  -  </?(0nu)  +  <p(0de)} 

instead  of  rA(x).  We  compare  rA(x)  to  Method  (B)  and  Method  (C),  since  the  same 
normalization  factor  as  rA(x)  appears  in  the  other  methods  as  shown  below.  This  fact 
facilitates  the  theoretical  analysis. 


3.3  Method  (B) 

For  the  exponential  model  (8),  the  optimization  problem  (5)  is  expressed  as 


(0B,  0b,o)  =  argmax 

(0,6q)^QxM 


=  argmax 


y bg  r(*r‘:"Aj  +  y log 

^  8  1  +  r(xf;  0,60)  y  b 

exp  {6*0  +  0T^(a;fu)} 

^  &  1  +  exp  {  90  +  6t£  (xfn) } 


i=  1 
n 


1  +  r(xf]  0 ,  90) 


^  log 


1 


y  l  +  exp{0o  +  ^Ke)}_ 

The  density  ratio  estimator  fg( x )  for  the  exponential  ratio  model  is  expressed  as 

rB(x)  =  exp  |0g^(a3)|  -  J^exp  )} 

'  j= i 


3.4  Method  (C) 

For  the  exponential  model  (8),  the  optimization  problem  (6)  is  expressed  as 


(Oc,9c,o)=  argmax 

(0,6>o)e©xM 


=  argmax 

(0,0o)£=€)x® 


i  n  i  n 

Wiogr  (*?";»,»«)  - -Tr(xf-,e,e0) 

Z — /  n  Z — / 

3  =  1 

n 


i= 1 
n 


n 

1  +  eJS(x‘"))  - 1  EexP  A  +  «T«(®?)} 


n 


(9) 


i= 1  j= 1 

The  density  ratio  estimator  Tc(x)  for  the  exponential  ratio  model  is  expressed  as 


rc(x )  =  exp 


-l 
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4  Accuracy  Analysis  for  Correctly  Specified  Expo¬ 
nential  Models 


In  this  section,  we  theoretically  analyze  the  accuracy  of  the  above  three  density  ratio 
estimators  under  the  assumption  that  the  true  densities  p*u(:r)  and  pde(x)  both  belong 
to  the  exponential  family,  i.e.,  there  exist  6*m  e  0  and  6*de  e  0  such  that 

P*nu(x )  =p(x;0  *J, 

Pde(X)  =  p{x-6  *de). 

Since  the  ratio  of  two  exponential  densities  also  belongs  to  the  exponential  model,  the 
above  assumption  implies  that  there  exist  6*  e  0  and  9d  e  M  such  that 

r*(x)  =  r(x\  6* ,  6q).  (10) 

It  is  straightforward  to  extended  the  results  in  this  section  to  general  parametric  models, 
since  we  focus  on  the  first-order  asymptotics  of  the  estimators.  An  arbitrary  parametric 
model  p(x;  0)  has  the  same  first-order  asymptotics  as  the  exponential  model  of  the  form 

pexp{x-,  6)  oc  exp{logp(ay  6*)  +  (0  -  0*)TV  log p(x;  6*)} 

around  the  parameter  6* .  Thus  the  same  theoretical  property  holds. 

First,  we  analyze  the  asymptotic  behavior  of  J(r a).  Then  we  have  the  following  lemma 
(proofs  of  all  lemmas,  theorems,  and  corollaries  are  provided  in  Appendix). 


Lemma  1  J(r a)  can  be  asymptotically  expressed  as 


a) 


1 

2  n 


dime  +  tr  {F(0’„)F(0 J,)"1)  +  PE(p2J|p*u) 


+  C>(n-3/2), 


where  O(-)  denotes  the  asymptotic  order.  F(0)  denote  the  Fisher  information  matrix  of 
the  exponential  model  p(x;  6) : 


V  log p(x;  6)  V  logp(a;;  G)Tp(x;  6)dx , 


where  V  denotes  the  partial  differential  operator  with  respect  to  6.  PE(p||g)  denotes  the 
Pearson  divergence  of  two  densities  p  and  q  defined  as 

PE(*)  :^  /  MMid,  (ID 

Next,  we  investigate  the  asymptotic  behavior  of  J(Fb)  and  J(fc).  Let  y  be  the  selector 
variable  taking  ‘nu’  or  ‘de’  as  defined  in  Section  2.4.  The  statistical  model  of  the  joint 
probability  for  2  =  ( x ,  y)  is  defined  as 


q(z]0,90)  =  q(y\x]0,9o)  x 


P*m(X)  +  P*de(X) 


2 


(12) 
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where  q(y\x;  6 ,  d0)  is  the  conditional  probability  of  y  such  that 


q{y  =  ‘nu’|a;;  0, 90)  = 


q{y  =  ‘de’|ay  6, d0)  = 


r(x-,d,60) 

1  +  r( x;  6,  d0) 
exp  {6*o  +  flT£(a:)} 

1  +  exp  [d0  +  0T£(cc)}  ’ 
1 

1  +  r(x]  6,  d0) 

1 

1  +  exp  {0O  +  0T£(aj)} 


The  Fisher  information  matrix  of  the  model  (12)  is  denoted  as 


F(6,d0)  G 


im  0+1)  X  (dim  0+1) 


The  submatrix  of  F(G.  do)  formed  by  the  first  (dim  0)  rows  and  the  first  (dim  0)  columns 
is  defined  as 

J  V  log  q(z\  G.  do)  V  log  q(z]  0,  6»0)T q(z;  G,  d0)dz. 

The  inverse  matrix  of  F(6,  d0)  is  expressed  as 


F(G,d0 


Hn(0,6o)  h12(G,d0)\ 
MMo)T  h22(0,do)J' 


where  -H"n(0,  0o)  is  a  (dim©)  x  (dim©)  matrix.  Then  we  have  the  following  lemmas. 
Lemma  2  J(f b)  can  be  asymptotically  expressed  as 


^  tr(F«jHn(e*,#;))  +  PE(pii|p;u)  +o(n-3/2), 


where  (G*,Gq)  is  defined  in  Eq. (10). 

Lemma  3  J(rc)  can  be  asymptotically  expressed  as 


A? c)  =  ^  dime  +  tr  (F^J-'G)  +  PE04JKJ  +  0(n~3'2), 


where 


G  :=  /  r*(x)(£(x)  -  TJnvL)(€(x)  -  rjnu)Tp*nu(x)dx. 
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Based  on  the  above  lemmas,  we  compare  the  accuracy  of  the  three  methods.  For  the 
accuracy  of  (A)  and  (B),  we  have  the  following  theorem. 

Theorem  4  Asymptotically,  the  inequality 

J{rA)  <  J(tb) 

holds. 

Thus  the  method  (A)  is  more  accurate  than  the  method  (B)  in  terms  of  the  expected 
unnormalized  Kullback-Leibler  divergence  (3).  Theorem  4  may  be  regarded  as  an  ex¬ 
tension  of  the  result  for  binary  classification  [5]:  estimating  data  generating  Gaussian 
densities  by  maximum  likelihood  estimation  has  high  statistical  efficiency  than  logistic 
regression  in  the  sense  of  classification  error  rate. 

Next,  we  compare  the  accuracy  of  (B)  and  (C). 

Theorem  5  Asymptotically,  the  inequality 

J(rB)  <  J(rc ) 

holds. 

Thus  the  method  (B)  is  more  accurate  than  the  method  (C)  in  terms  of  the  expected 
unnormalized  Kullback-Leibler  divergence  (3).  This  inequality  is  a  direct  consequence  of 
the  paper  by  Qin  [16].  In  that  paper,  it  was  shown  that  the  method  (B)  has  the  smallest 
asymptotic  variance  in  a  class  of  semi-parametric  estimators.  It  is  easy  to  see  the  method 
(C)  is  included  in  the  class. 

Finally,  we  compare  the  accuracy  of  (A)  and  (C).  From  Theorem  4  and  Theorem  5, 
we  immediately  have  the  following  corollary. 

Corollary  6  The  inequality 

J(rA)  <  J(rc) 

holds. 

It  was  advocated  that  one  should  avoid  solving  more  difficult  intermediate  problems 
when  solving  a  target  problem  [33] .  This  statement  is  sometimes  referred  to  as  “Vapnik’s 
principle”,  and  the  support  vector  machine  [4]  would  be  a  successful  example  of  this 
principle — instead  of  estimating  a  data  generation  model,  it  directly  models  the  decision 
boundary  which  is  sufficient  for  pattern  recognition. 

If  we  followed  Vapnik’s  principle,  directly  estimating  the  ratio  r*(x)  would  be  more 
promising  than  estimating  the  two  densities  p*u(;e)  and  p*Ae(x)  since  knowing  p*u(cc)  and 
Pje(:r)  implies  knowing  r*(x)  but  not  vice  versa;  indeed,  r*(x)  cannot  be  uniquely  decom¬ 
posed  into  p*u(cc)  and  p^e(cc).  Thus  Corollary  6  is  at  a  glance  counter-intuitive.  However, 
Corollary  6  would  be  reasonable  since  the  method  (C)  does  not  make  use  of  the  knowledge 
that  each  density  is  exponential,  but  only  the  knowledge  that  the  ratio  is  exponential. 
Thus  the  method  (A)  can  utilize  the  a  priori  model  information  more  effectively.  Thanks 
to  the  additional  knowledge  that  the  both  densities  belong  to  the  exponential  model,  the 
intermediate  problems  (i.e.,  density  estimation)  were  actually  made  easier  in  terms  of 
Vapnik’s  principle. 
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5  Accuracy  Analysis  for  Misspecified  Exponential 
Models 


In  this  section,  we  theoretically  analyze  the  approximation  error  of  the  three  density 
ratio  estimators  for  misspecified  exponential  models,  i.e.,  the  true  densities  and  ratio  are 
not  necessarily  included  in  the  exponential  models.  The  unnormalized  Kullback-Leibler 
divergence  is  employed  to  measure  the  approximation  error. 

First,  we  study  the  convergence  of  the  method  (A).  Let  pnu(a;)  and  pde(x)  be  the 
projections  of  the  true  densities  p*u(cc)  and  pde(x)  onto  the  model  p(x;  6)  in  terms  of  the 
Kullback-Leibler  divergence  (2): 

Pnn(X)  :=P(X;0 nu), 

Pde(x)  :=p(x-0de), 


where 


0im  :=  argmin 
0ee 

0 de  :=  argmin 
eee 


j  Pnu(X) 
l  Pde(X ) 


j£u(gO 

p(x;0) 

Pde(X) 

p(x;0) 


da; 

da; 


5 


This  means  that  pmi(x)  and  pde(a;)  are  the  optimal  approximations  to  p*u(a;)  and  pde(a;) 
in  the  model  p(x;  6)  in  terms  of  the  Kullback-Leibler  divergence.  Let 


ta(x) 


Pnu(X) 

Pde(X)' 


Since  the  ratio  of  two  exponential  densities  also  belongs  to  the  exponential  model,  there 
exists  0a  ^  ©  such  that 


rA(x)  =  r(x;  dAl  6a,o)- 

Then  we  have  the  following  lemma. 

Lemma  7  rA  converges  in  probability  to  rA  as  n  — >  oo. 

Next,  we  investigate  the  convergence  of  the  method  (B).  Let  q*(x,y)  be  the  joint 
probability  defined  as 


Q*(X,V) 


q*(y\x)  x 


Pnufo)  +P*de(x) 
2 


where  q*(y\x)  is  the  conditional  probability  of  y  such  that 


(14) 
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The  model  (12)  is  used  to  estimate  q*(x,y),  and  let  q(x,y)  be  the  projection  of  the 
true  density  q*(x,y)  onto  the  model  (12)  in  terms  of  the  Kullback-Leibler  divergence  (2): 

q(x,y)  -=q{x,y\ 6b,0b,o),  (15) 

where 


(0b,#b,o):=  argmin 


/  Y  q*(x,y)i  og 


q*(y\x) 


q(y\x;0,60 


-da; 


j/e{‘nu’,‘de’} 

This  means  that  q(x,y)  is  the  optimal  approximation  to  q*(x,y)  in  the  model 

f  \ „  n  a  \Pnu(X)  +  P*de(X) 

q{y\xi  o,  &o) - 2 - 

in  terms  of  the  Kullback-Leibler  divergence.  Let 

fB(x)  :=  r(x;0B,0Bfi). 


Then  we  have  the  following  lemma. 

Lemma  8  rB  converges  in  probability  to  rB  as  n  — >  oo. 

Finally,  we  study  the  convergence  of  the  method  (C).  Suppose  that  the  model 
r(x ;  0,9q)  in  Eq.(8)  is  employed.  Let  f c(x)  be  the  projection  of  the  true  ratio  function 
r*(x)  onto  the  model  r(x]  6,  0O)  in  terms  of  the  unnormalized  Kullback-Leibler  divergence 
(1): 

fc(x)  :=  r(cc;0c,0c,o), 


where 


(0c,  6>c,o)  :=  argmin 


Pnu  (X)  l()g 


r  x 


r(x;0,6 o 


-da;  -  1  +  /  p*de(x)r(x ;  6,  do)dx 


(16) 


This  means  that  rc(x)  is  the  optimal  approximation  to  r*(x)  in  the  model  r(x;  6)  in  terms 
of  the  unnormalized  Kullback-Leibler  divergence.  Then  we  have  the  following  lemma. 


Lemma  9  rc  converges  in  probability  to  tq  as  n  oo. 


Based  on  the  above  lemmas,  we  investigate  the  relation  among  the  three  methods. 
Lemma  9  implies  that  the  method  (C)  is  consistent  to  the  optimal  approximation  fc- 
However,  as  we  will  show  below,  the  methods  (A)  and  (B)  are  not  consistent  to  the 
optimal  approximation  fc  in  general.  Let  us  measure  the  deviation  of  a  density  ratio 
function  r'  from  r  by 

D(r',r)  :=  j p*de(x )  (r'(x)  —  r(x))2  dx. 

Then  we  have  the  following  theorem. 
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Theorem  10  The  inequalities 

D(rA,fc)  > 
D(rB,rc)  > 


p*de(x)rA(x)dx  -  1 


pde(x)rB(x)dx  -  1 


hold.  More  generally,  for  any  r  in  the  exponential  model, 


D(r,rc)  > 


P*de(x)  r(x)dx  ~  1 


holds. 


(17) 


When  the  model  is  misspecified,  pde(x)  rA(x)  and  pde(x)  fB(a3)  are  not  probability  densi¬ 
ties  in  general.  Then  Theorem  10  implies  that  the  method  (A)  and  the  method  (B)  are 
not  consistent  to  the  optimal  approximation  re- 

Since  model  misspecification  would  be  a  usual  situation  in  practice,  the  method  (C) 
is  the  most  promising  approach  in  density  ratio  estimation. 

Finally,  for  the  consistency  of  the  method  (A),  we  also  have  the  following  additional 
result. 

Corollary  11  If  pde(x)  belongs  to  the  exponential  model  (7),  i.e.,  there  exists  6de  G  0 
such  that 


P*de(x)  =  p{x\0  de), 


then 


r  a  =  rc 

holds  even  when  p*nu(x)  does  not  belong  to  the  exponential  model  (7). 

This  corollary  means  that,  as  long  as  pde(x)  is  correctly  specified,  the  method  (A)  is 
still  consistent. 


6  Conclusions 

In  this  paper,  we  theoretically  investigated  the  accuracy  of  three  density  ratio  estimation 
approaches:  (A)  density  ratio  estimation  by  separate  maximum  likelihood  density  estima¬ 
tion,  (B)  density  ratio  estimation  by  logistic  regression,  and  (C)  direct  density  ratio  es¬ 
timation  by  empirical  Kullback-Leibler  divergence  minimization.  Intuitively,  the  method 
(C)  seems  to  be  better  than  the  other  approaches  due  to  “Vapnik’s  principle” — one  should 
not  solve  more  difficult  intermediate  problems  (density  estimation  in  the  current  context) 
when  solving  a  target  problem  (density  ratio  estimation  in  the  current  context). 
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However,  as  we  proved  in  Section  4,  the  method  (A)  is  more  accurate  than  the  other 
approaches  when  the  numerator  and  denominator  densities  are  known  to  be  members 
of  the  exponential  family.  This  result  is  at  first  sight  counter-intuitive,  but  it  would  be 
reasonable  because  the  methods  (B)  and  (C)  do  not  make  use  of  the  knowledge  that  each 
density  is  exponential,  but  only  the  knowledge  that  their  ratio  is  exponential.  Thus  the 
method  (A)  can  utilize  the  a  priori  model  information  more  effectively  than  the  other 
methods.  We  note  that  this  result  is  not  contradictory  to  Vapnik’s  principle  since  the 
additional  knowledge  that  the  densities  belong  to  the  exponential  model  is  utilized  to 
make  the  intermediate  problems  (density  estimation)  substantially  easier. 

On  the  other  hand,  once  the  correct  model  assumption  is  not  fulfilled,  the  method  (C) 
was  shown  to  be  consistent  to  the  optimal  approximation  in  the  model,  while  the  methods 
(A)  and  (B)  are  not  consistent  in  general  (see  Section  5).  The  fact  that  the  direct  method 
outperforms  the  other  approaches  in  the  absence  of  the  additional  knowledge  would  follow 
Vapnik’s  principle. 

It  seems  to  be  a  common  phenomenon  in  various  situations  that  a  method  which 
works  optimally  for  correctly  specified  models  performs  poorly  for  misspecihed  models 
and  conversely  a  method  which  works  well  for  misspecihed  models  performs  poorly  for 
correctly  specified  models.  For  example,  in  active  learning  (or  the  experiment  design), 
the  traditional  variance-only  approach  works  optimally  for  correctly  specified  models  [6]. 
However,  it  was  shown  that  the  traditional  method  works  poorly  once  the  correct  model 
assumption  is  slightly  violated  [20].  To  cope  with  this  problem,  various  active  learning 
methods  which  do  not  require  the  correct  model  assumption  have  been  developed  and 
shown  to  work  better  than  the  traditional  method  for  misspecihed  models  [34,  11,  20,  9, 
24],  However,  these  methods  cannot  outperform  the  traditional  method  when  the  model 
is  correctly  specihed.  Thus  the  performance  loss  for  correctly  specihed  models  would  be 
the  price  one  has  to  pay  for  acquiring  robustness  against  model  misspecihcation. 

Model  misspecihcation  would  almost  always  occur  in  practice,  so  developing  methods 
for  misspecihed  models  is  crucial.  Based  on  these  observations,  we  conclude  that  the 
direct  density  ratio  approach  (C)  is  the  most  promising  density  ratio  estimation  method. 
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A  Asymptotic  Expansion  of  Measure  of  Accuracy 

First,  we  show  some  fundamental  results  used  for  proving  Lemma  1,  Lemma  2,  and  Lemma 
3. 

Using  the  Taylor  expansion 

log(l  +  t)  =  t-t-  +  0(t3), 
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we  have  the  following  expansion: 


log 


PnuW 


r(x)p*Ae(x) 


=  log 


=  -  log 


r  { x) 
r(x ) 
r(x) 


r*{x) 

r{x ) 
r*(x) 


2  V  r*(x) 


1  I  +  Ov 


r(x) 


r*(x) 


where  Op(-)  denotes  the  stochastic  order.  Substituting  this  expansion  into  the  unnormal¬ 
ized  Kullback-Leibler  divergence  UKL(p*u||r  •  p^e),  we  obtain 


UKLKJP -p‘J  =  PEMJf ■&)  +  0(||P/r-  -  1||3). 


(18) 


where  ‘PE’  denotes  the  Pearson  divergence  defined  by  Eq.(ll)  and  || r/r*  —  1||  is  dehned 
as 

\  1/2 


I  r/r*  -  1||  :=  (^J  p*m(x)\r(x)/r*(x)  -  l|2da^ 


Under  a  regularity  condition  of  asymptotic  statistics,  the  expectation  E  [||r/r*  —  1||3]  is 
of  order  0(n~3^2): 

E  [|| r/r*  -  1||3]  =  0{n~3/2). 

See  Theorem  5.23  in  [32]  for  the  details  of  the  regularity  condition  on  general  M-estimators. 
ffence,  the  measure  of  accuracy  J(r)  can  be  represented  as 

J(r)  =  E  [PE&CII?  •  pi)|  +  0(n-3y.  (19) 


Then  we  have  the  following  lemma. 

Lemma  12  (Asymptotics  of  measure  of  accuracy)  Let  6  be  an  estimator  of  the  pa¬ 
rameter  6*  in  r* ,  and  r(x)  be  the  estimator  defined  as 

(i  n 

-^exp{0T£(a;f)} 
j= i 

Then,  the  measure  of  accuracy  of  r  is  asymptotically  given  as 

J(r)  =  |tr  (F(«*„)  ■  E  [«M0T])  +  ^PEfelKJ  +  0(n-3'2),  (20) 

where  56  denotes  the  deviation  of  6  from  the  parameter  6* : 


56  =  6  -  6*. 
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Proof:  The  probabilistic  order  of  SO  is  Op{;n  1/2).  Let  rjnu  be 


Onn  ■=  /  €(*Ku(*)da!- 


Using  the  Taylor  expansion 


exp  (t)  —  1  +  t  +  0(t 2), 
log(l  +  t)  =  t  +  0(t2), 

we  have  the  following  the  asymptotic  expansion  of  r: 

r*(x)  exp  (h0T^(a:)} 

log  TyXj  =  log  y— - — — - 

i  Ej=i  r*(xf)  exp  {56/  £(xf )} 

=  logr*(cc)  +  SOT£(x) 

-  log  |l  +  r*(xf)  -  l'j  +  S0T  ■  r*(xf  )$(xf  )  +  Op{n~l 


3= 1 


3= 1 


=  logr*(cc)  +  50T(£(:r)  —  r)T 
Therefore,  we  have 


n 


“  1 )  +Op{n  1). 


3= 1 


rlx 


nix) 


1  =  S0T(d(x)  -  Tlnu)  ~  -  E  -  1  +  °P(n~1y 


3= 1 


Substituting  the  above  expression  into  the  Pearson  divergence  in  Eq.(19),  we  obtain 


PE(PnulF'Pde)  =  2  /  Pnu(X) 


r{x ) 
r*(x) 


1  dx 


hr  (f(«;,)»T)  +  \  (  i  £>*«)  -1)1+  Op(n-3/2). 


3= 1 


Therefore, 

E  [PE(p;j|r  -riJ]  =  itr  (F(<C)  •  E  [«M«T])  +  ^  J Pa,(x)(r’(x)  -  1  )2dx 
+  C>(n-3/2) 

=  yr  (F(<C)  ■  E  [f»T])  +  PpE(p3. ||p*u)  +  0(n-+2). 


Applying  Eq  (18)  to  the  above  equation,  we  obtain  Eq.(20).  ■ 
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B  Proof  of  Lemma  1 

According  to  Lemma  12,  we  need  to  compute  the  asymptotic  variance  of  estimator  d\ 
in  order  to  compute  the  measure  of  accuracy  of  ?a-  Based  on  the  standard  asymptotic 
statistics,  the  asymptotic  variance  of  the  maximum  likelihood  estimator  for  the  exponen¬ 
tial  family  is  given  as 

VS(§«»  -KJ-JVfO.FWJ-1), 

-  «5«)  ~  Jv(o  .Ffe-j-1), 

when  the  sample  size  n  goes  to  infinity.  Under  the  regularity  condition  of  parametric 
estimation,  the  bias  of  estimator  is  given  as 

E  [eim  -  6>:u]  =  Oin-1), 

Epde-0*e]  =  0(n~1). 

Then,  for 

m  :=  eA  -  (»:„  -  e-j 

=  (§»„  -  -  (»de  - 

we  have 

E  +  if(ei)-1  +  o(n-3/2). 

L  J  n  n 

where  we  used  the  fact  that  0nu  and  0de  are  independent.  Substituting  the  above  asymp¬ 
totic  variance  of  SO  a  into  the  first  term  of  Eq.(20),  we  obtain 

tr  (F(e;„)(F(6 >*„)-'  +  F(©;e)-'))  +  PEKJIpI)  +  o(n-3'2), 

dim  0  +  tr  (F(e;jF(ey-')  +  PE(pAellPmi)  +  0(n"3/2). 
which  concludes  the  proof.  ■ 

C  Proof  of  Lemma  2 

Let  (0b,0b,o)  be  the  maximum  likelihood  estimator  with  the  model  (12).  Let 

S0B  :=  0B  -  0* 

=  0B-(Cu-0de). 


Theoretical  Analysis  of  Density  Ratio  Estimation 


19 


Based  on  the  standard  asymptotic  statistics,  the  asymptotic  variance  of  the  maximum 
likelihood  estimator  for  the  exponential  family  is  given  as 

V^66B  ~N(O,Hn(0*,e*o)), 

when  the  sample  size  n  goes  to  infinity.  Hu(0,9q)  is  the  submatrix  of  the  inverse  matrix 
of  the  Fisher  information  matrix  as  defined  in  Eq.(13)  and  (0*,9q)  is  the  parameter 
corresponding  to  the  density  ratio  r*(x).  Hence,  the  asymptotic  variance  of  60 b  is  given 
as 

E  [S0BS0l]  =  —Hu(0*,9q)  +  0(n~3/2). 
n 

Substituting  the  asymptotic  variance  of  66 b  into  the  first  term  of  Eq.(20),  we  establish 
the  lemma.  ■ 


D  Proof  of  Lemma  3 

By  simple  calculation,  we  find  that  the  optimal  solution  (0c,#c,o)  satisfies 

1 


#c,o  — 


\  3=1 


The  extremal  condition  for  Eq.  (9)  with  the  above  expression  provides  the  following  equa¬ 
tion: 


1 


n 


i=  1 


xT)  = 


Ey.expj 

{§CT«)] 

xfe) 

[®c€(* 

de\  1 

i  )j 

l 

Let  60r,  be 


66 c  :=  6 c  -  6* 

=  »c  -  . 


Then,  Eq.(21)  is  represented  as 

n 

£e 

i= 1 

Using  the  Taylor  expansion 


1 

n 


xfu)  = 


Ej=Tr*(xfe)expj 

[<50 

c  £(*?)] 

xje) 

J2j=ir*(xje)  exp  <j 

[«  c«(* 

deN  1 

i  )j 

\ 

exp(t)  =  1  +  t  +  0(t 2), 
1  =1  +  t  +  (D(t2), 


i  -  f 


(21) 
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the  asymptotic  expansion  of  the  right-hand  side  of  the  above  equation  yields 


n 


5> 

i= 1 


^  n  ri 

5]  r-(x?)Z(xf  )  +  -  Y, r-(xf)i(xf  )i(xf)TSd, 

U  3=1  U  3=1 


x  n 

i 


n 


E  r*«)  -  1  -  -  E  r*(*?)£(*?)T<Wc  +  Op 


i=i 


i=i 


n 


E ->?)£(*?)  -  -  E  -  E  r*(4e)  - 1 


i=i 


t'=i 


de\T 


n 


i=i 

Wr'K'itKWA 

.  3=1 

1  n  ^ 

-  - E >•*(*?)««)- E C(X?)««)T  U§C  +  0„(n 

i=i  y=i  J 

n  /  1  n 

E ’■*(*?««)  -  (v™ + o!,(™_1/2))  - Er*K*)  - 1 

i=i  V  3=i 

F(e:m)Sdc  +  Op(n 


-u 


-i> 


=  i?. 


n 


E  ’•*«)  («(*?)  -  »U + f(9;,)4  +  Opt™-1) 


3=1 


Hence,  we  obtain 


=  ±£(««r)  -  ’>,»)  -  -E>-(**)«(**)  -  p„J  +  Op(n-‘). 

n  n  ' 

i=l  3=1 

When  the  sample  size  goes  to  infinity,  the  central  limit  theorem  provides 


(22) 


1 


-ft  E<««')  "  >»J~  ^(O.-FfSL)), 


i=  1 


3=1 


(23) 


where  G  is  the  matrix  defined  in  Lemma  3.  Combining  Eqs.(22)  and  (23),  we  obtain  the 
following  expression  of  the  asymptotic  variance  of  56 c- 


E 


50c,56r 


=  —F(0* 
n 


\-i 


+  -F(6l 

n 


\-i 


GF{6* 


\-i 


+  0(n~3/2). 


Substituting  the  asymptotic  variance  of  59 c  into  the  first  term  of  Eq.(20),  we  establish 
the  lemma.  ■ 
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E  Proof  of  Theorem  4 


We  compare  the  coefficients  of  order  0(n  in  J(ja)  and  J(r b),  and  prove  the  following 
inequality: 


dim  e  +  tr  (F(e;jf’(^e)-1)  <  tr  $•)) . 

Let  Fr)(619o)  be  the  Fisher  information  matrix  of  the  logistic  model 

exp{0o  +  0T(£(a:)  ~  rj)} 


(24) 


qv(y  =  W|cc;0,6>o)  = 


qv(y  =  ‘de’|a;;  0, 90)  = 


1  +  exp  [90  +  0t(£(x)  -  77)}  ’ 
1 


(25) 


1  +  exp  {90  +  0T(£(x)  -  77)}  ’ 
where  77  is  a  fixed  vector.  Let  us  represent  Fv(6,  9q )_1  in  a  block  form  as 


Fv(0,9, 


-1  _  ( Hv>11(d,90)  hv  i2(6,90) 

hv,12(0^o)T  hr), 22(0)  $o) 


Oi  - 


When  the  functions  l,£i(:r),  . . .  ,C,k(x)  arc  linearly  independent,  the  maximum  likelihood 
estimator  (mle)  of  0  for  model  (25)  is  given  by  0b-  The  equality 

9o  +  eT£(x)  =  90+dT(Z(x)-r1) 

implies  0  =  0  and  90  =  90  —  6  77  =  90  —  0  Trj.  Due  to  the  equality  0  =  0,  we  see  that  the 
mle  of  0  is  equal  to  that  of  0,  and  hence,  the  variance  is  unchanged  under  the  parameter 
transformation,  that  is, 

iJ„,ii(0A)  =  Hv,11(d.9o  +  0Tr])  =  Jfn(0,0o) 

holds  for  any  rj.  The  Fisher  information  matrix  Fv  can  be  represented  as 


Fv(6*,9*0  +  d*'  rj)  =  - 


P*nu(X)P*dc(X)  f £(*)  -  0 
Pnu(X)  +Pde(X)  V  1 


((£(X)-0)T  l)d* 


r),  11  f  r),  12 
.  T  ~ 

,fr),  12  fr), 22  / 

The  first  equality  is  obtained  by  the  straightforward  calculation  of  the  Fisher  information 
matrix.  Applying  the  matrix  inversion  formula  to  the  block  form,  we  obtain 

i*T„ 


Hn(6* ,  9q)  =  iT 77,11(0*,  0q  +  0*  77) 

--1 


=  F 


~-l  ~  ~T  ~-l 
^ T),\IJ  77,12/  77 , 1 2 r7 , 1 1 
rj.n  +  ~  ~  t  ~-i  ~  ■ 

fr),  22  —  f  r)  ,12-^77,11/77,12 
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Since  F„  is  positive  definite,  we  have 


~  ~  T  ~-l  ~ 

fr), 22  —  f  12-^77, ll/rj, 12  >  0) 


and  hence,  we  obtain  the  inequality 


Hn{0*,0o)hF 


-l 

r7.11> 


for  any  rj.  In  the  above  formula,  A  y  B  indicates  the  fact  that  the  matrix  A  S  is 
positive  semidefinite.  On  the  other  hand,  the  inequalities 


vi1 


P*m(X)P*de(X) 
Plu(X)  +  P*de(X) 


(£(*)  -  »7nu)(£(*)  -  VnuV^X 


=<  2  /  ^nu(*)(€(*)  -  »7nu)(^(*)  -  »7nu)Td* 


=  2F(0nu)> 


*7de>n 


PnufoKofo) 

Plu(X)  +  P*de(X) 


(€(X)  -  Rde)^)  ~  RdeV^ 


A  -  J  pde(x)(£(x)  -  r]de){£(x)  -  r]de )  dx 
= 


hold.  Therefore,  we  obtain 


l~-i 


l~-i 


Hu(e\ei)  y-FVmill  +  -Fn^n 

t  F(e 

By  multiplying  F(6*m)  from  the  left-hand  side  and  taking  the  trace  of  both  sides,  we 
obtain  the  inequality  (24).  ■ 

F  Proof  of  Theorem  10 

We  prove  the  general  expression  (17)  for  any  f  in  the  exponential  model.  The  optimality 
condition  of  Eq.(16)  provides  the  equality 


P*de(x)rc(x)dx  =  /  PnU(aj)da;  =  1. 


Hence,  we  have 


P*de(x)(r(x)  -  rc(x))dx  =  /  p*de(x)r(x)dx  -  1. 
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Applying  the  Schwarz  inequality  to  the  above  equality,  we  obtain 


D(r ,  r c)  > 


p*de(x)  r(x)dx  —  1 


Thus,  r  is  different  from  rc  unless  pde  ■  r  is  a  probability  density.  ■ 

G  Proof  of  Corollary  11 

The  optimality  condition  of  the  method  (A)  provides  the  equality 

j  P*nu(X)€(X)dx  =  l  Pnn(XMX)dx- 

Substituting  the  equality  pnu(x)  =  pde(x)r A(x)  into  the  above  expression,  we  have 

J  Pnu  (*)£(*)<!*  =  J  pde(x)rA(x)Z(x)  dx. 

When  belongs  to  the  exponential  model,  we  have  pde  =  pde  and  thus,  the  equality 


j  Pnu(*)^(*)da!  =  J  P*de(X)rA(x)£(x)dx 

holds.  The  above  equation  is  exactly  the  same  as  the  optimality  condition  of  Eq.(16)  for 
the  method  (C).  Thus,  rA  =  rc  holds.  ■ 
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Abstract 

Accurately  evaluating  statistical  independence  among  random  variables  is  a  key 
element  of  Independent  Component  Analysis  (ICA).  In  this  paper,  we  employ  a 
squared-loss  variant  of  mutual  information  as  an  independence  measure  and  give  its 
estimation  method.  Our  basic  idea  is  to  estimate  the  ratio  of  probability  densities 
directly  without  going  through  density  estimation,  by  which  a  hard  task  of  density 
estimation  can  be  avoided.  In  this  density  ratio  approach,  a  natural  cross-validation 
procedure  is  available  for  hyper-parameter  selection.  Thus,  all  tuning  parameters 
such  as  the  kernel  width  or  the  regularization  parameter  can  be  objectively  opti¬ 
mized.  This  is  an  advantage  over  recently  developed  kernel-based  independence 
measures  and  is  a  highly  useful  property  in  unsupervised  learning  problems  such 
as  ICA.  Based  on  this  novel  independence  measure,  we  develop  an  ICA  algorithm 
named  Least-squares  Independent  Component  Analysis  (LICA). 


1  Introduction 

The  purpose  of  Independent  Component  Analysis  (ICA)  (Hyvarinen  et  al.,  2001)  is  to 
obtain  a  transformation  matrix  that  separates  mixed  signals  into  statistically-independent 
source  signals.  A  direct  approach  to  ICA  is  to  find  a  transformation  matrix  such  that 
independence  among  separated  signals  is  maximized  under  some  independence  measure 
such  as  mutual  information  (MI). 


*A  MATLAB®  implementation  of  the  proposed  algorithm,  LICA,  is  available  from 

‘http : //www . simplex . t . u-tokyo . ac . jp/~ s-taij i/ software/LICA/ index .htmT. 
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Various  approaches  to  evaluating  the  independence  among  random  variables  from 
samples  have  been  explored  so  far.  A  naive  approach  is  to  estimate  probability  densities 
based  on  parametric  or  non-parametric  density  estimation  methods.  However,  finding 
an  appropriate  parametric  model  is  not  easy  without  strong  prior  knowledge  and  non- 
parametric  estimation  is  not  accurate  in  high-dimensional  problems.  Thus  this  naive 
approach  is  not  reliable  in  practice.  Another  approach  is  to  approximate  the  negentropy 
(or  negative  entropy)  based  on  the  Gram-Charlier  expansion  (Cardoso  &  Souloumiac, 
1993;  Comon,  1994;  Amari  et  al.,  1996)  or  the  Edgeworth  expansion  (Hulle,  2008).  An 
advantage  of  this  negentropy-based  approach  is  that  a  hard  task  of  density  estimation  is 
not  directly  involved.  However,  these  expansion  techniques  are  based  on  the  assumption 
that  the  target  density  is  close  to  normal  and  violation  of  this  assumption  can  cause  large 
approximation  error. 

The  above  approaches  are  based  on  the  probability  densities  of  signals.  Another 
line  of  research  that  does  not  explicitly  involve  probability  densities  employs  non-linear 
correlation — signals  are  statistically  independent  if  and  only  if  all  non-linear  correlations 
among  the  signals  vanish.  Following  this  line,  computationally  efficient  algorithms  have 
been  developed  based  on  a  contrast  function  (Jutten  &  Herault,  1991;  Hyvarinen,  1999), 
which  is  an  approximation  of  negentropy  or  mutual  information.  However,  these  methods 
require  to  pre-specify  non-linearities  in  the  contrast  function,  and  thus  could  be  inaccurate 
if  the  predetermined  non-linearities  do  not  match  the  target  distribution.  To  cope  with 
this  problem,  the  kernel  trick  has  been  applied  in  ICA,  which  allows  one  to  evaluate 
all  non-linear  correlations  in  a  computationally  efficient  manner  (Bach  &  Jordan,  2002). 
However,  its  practical  performance  depends  on  the  choice  of  kernels  (more  specifically,  the 
Gaussian  kernel  width)  and  there  seems  no  theoretically  justified  method  to  determine  the 
kernel  width  (see  also  Fukumizu  et  ah,  2009).  This  is  a  critical  problem  in  unsupervised 
learning  problems  such  as  ICA. 

In  this  paper,  we  develop  a  new  ICA  algorithm  that  resolves  the  problems  mentioned 
above.  We  adopt  a  squared-loss  variant  of  MI  (which  we  call  squared-loss  MI ;  SMI)  as  an 
independence  measure  and  approximate  it  by  estimating  the  ratio  of  probability  densities 
contained  in  SMI  directly  without  going  through  density  estimation.  This  approach — 
which  follows  the  line  of  Sugiyama  et  al.  (2008),  Kanamori  et  al.  (2009),  and  Nguyen 
et  al.  (2010) — allows  us  to  avoid  a  hard  task  of  density  estimation.  Another  practical 
advantage  of  this  density-ratio  approach  is  that  a  natural  cross-validation  (CV)  procedure 
is  available  for  hyper-parameter  selection.  Thus  all  tuning  parameters  such  as  the  kernel 
width  or  the  regularization  parameter  can  be  objectively  and  systematically  optimized 
through  CV. 

From  an  algorithmic  point  of  view,  our  density-ratio  approach  analytically  provides  a 
non-parametric  estimator  of  SMI;  furthermore  its  derivative  can  also  be  computed  ana¬ 
lytically  and  these  properties  are  utilized  in  deriving  a  new  ICA  algorithm.  The  proposed 
method  is  named  Least-squares  Independent  Component  Analysis  (LICA). 

Characteristics  of  existing  and  proposed  ICA  methods  are  summarized  in  Table  1, 
highlighting  the  advantage  of  the  proposed  LICA  approach. 

The  structure  of  this  paper  is  as  follows.  In  Section  2,  we  formulate  our  estimator  of 
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Table  1:  Summary  of  existing  and  proposed  ICA  methods. 


Hyper-parameter  selection 

Distribution 

Fast  ICA  (FICA) 
(Hyvarinen,  1999) 

Not  Necessary 

Not  Free 

Natural-gradient  ICA  (NICA) 
(Amari  et  al.,  1996) 

Not  Necessary 

Not  Free 

Kernel  ICA  (KICA) 

(Bach  &  Jordan,  2002) 

Not  Available 

Free 

Edgeworth-expansion  ICA  (EICA) 
(Hulle,  2008) 

Not  Necessary 

Nearly  normal 

Least-squares  ICA  (LICA) 
(proposed) 

Available 

Free 

SMI.  In  Section  3,  we  derive  the  LICA  algorithm  based  on  the  SMI  estimator.  Section  4  is 
devoted  to  numerical  experiments  where  we  show  that  our  method  properly  estimate  the 
true  demixing  matrix  using  toy  datasets,  and  compare  the  performances  of  the  proposed 
and  existing  methods  on  artificial  and  real  datasets. 


2  SMI  Estimation  for  ICA 

In  this  section,  we  formulate  the  ICA  problem  and  introduce  our  independence  measure, 
SMI.  Then  we  give  an  estimation  method  of  SMI  and  derive  an  ICA  algorithm. 


2.1  Problem  Formulation 

Suppose  there  is  a  d- dimensional  random  signal 

x  =  (x^, . . . ,  :r^)T 

drawn  from  a  distribution  with  density  p(x),  where  are  statistically  indepen¬ 

dent  of  each  other,  and  T  denotes  the  transpose  of  a  matrix  or  a  vector.  Thus,  p(x)  can 
be  factorized  as 

d 

p(x)  =  pm(x{m)). 

771=1 

We  cannot  directly  observe  the  source  signal  x,  but  only  a  linearly  mixed  signal  y : 

y  =  (ym,...,y^)T  :=  Ax, 

where  A  is  a  d  x  d  invertible  matrix  called  the  mixing  matrix.  The  goal  of  ICA  is,  given 
samples  of  the  mixed  signals  {t/j}”=1,  to  obtain  a  demixing  matrix  W  that  recovers  the 
original  source  signal  x.  We  denote  the  demixed  signal  by  2: 


2  =  Wy. 
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The  ideal  solution  is  W  =  A _1,  but  we  can  only  recover  the  source  signals  up  to  permuta¬ 
tion  and  scaling  of  components  of  x  due  to  non-identifiability  of  the  ICA  setup  (Hyvarinen 
et  al.,  2001). 

A  direct  approach  to  ICA  is  to  determine  W  so  that  components  of  z  are  as  indepen¬ 
dent  as  possible.  Here,  we  adopt  SMI  as  the  independence  measure: 

(^|-l)2rWdz,  (1) 

where  q(z)  denotes  the  joint  density  of  z  and  r(z)  denotes  the  product  of  marginal 
densities  {qm(z(m))}^=1: 

d 

r(z)  =  n  9m(z(m)). 

m—  1 

Note  that  SMI  is  the  Pearson  divergence  (Pearson,  1900;  Paninsky,  2003;  Liese  &  Vajda, 
2006;  Cichocki  et  al.,  2009)  between  q(z)  and  r(z),  while  ordinary  MI  is  the  Kullback- 
Leibler  divergence  (Kullback  &  Leibler,  1951).  Since  Is  is  non-negative  and  it  vanishes  if 
and  only  if  q{z )  =  r(z),  the  degree  of  independence  among  {z^}™=1  may  be  measured 
by  SMI.  Note  that  Eq.(l)  corresponds  to  the  f -divergence  (Ali  &  Silvey,  1966;  Csiszar, 
1967)  between  q(x)  and  r(z)  with  the  squared-loss,  while  ordinary  MI  corresponds  to  the 
/-divergence  with  the  log-loss.  Thus  SMI  could  be  regarded  as  a  natural  generalization 
of  ordinary  MI. 

Based  on  the  independence  detection  property  of  SMI,  we  try  to  find  the  demixing 
matrix  W  that  minimizes  SMI.  Let  us  denote  the  demixed  samples  by 

{*.  I  *.-(*{*’ . *P)T  W'vJS..- 

Our  key  constraint  when  estimating  SMI  is  that  we  want  to  avoid  density  estimation  since 
it  is  a  hard  task  (Vapnik,  1998).  Below,  we  show  how  this  could  be  accomplished. 


2.2  SMI  Approximation  via  Density  Ratio  Estimation 

We  approximate  SMI  via  density  ratio  estimation.  Let  us  denote  the  ratio  of  the  densities 
q(z)  and  r(z)  by 

:=  (2) 

Then  SMI  can  be  written  as 


IS(Z^\...,Z^)  =  -  /  (g*(z)-l)2r(z)dz 


2 g*{z)r{z)  +  r(z))  dz 


(g*(z)q(z)  -  2 q(z)  +  r(z))  dz 
g*(z)q(z)dz  - 


(3) 
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Therefore,  SMI  can  be  approximated  through  the  estimation  of  f  g*(z)q(z)dz,  the  ex¬ 
pectation  of  g*(z)  over  q(z).  This  can  be  achieved  by  taking  the  sample  average  of  an 
estimator  of  the  density  ratio  g*(z),  say  g(z): 


1 

2  n 


52g(zi) 

i— 1 


l 

2' 


(4) 


We  take  the  least-squares  approach  to  estimating  the  density  ratio  g*(z ): 


inf 

9 


g(z)  ~g*(z))  r(z)dz 


=  inf 

9 


\ g(z)2r(z )  -g(z)q(z))dz 


+  constant, 


where  inf9  is  taken  over  all  measurable  functions.  Obviously  the  optimal  solution  is  the 
density  ratio  g*.  Thus  computing  Is  is  now  reduced  to  solving  the  following  optimization 
problem: 


inf 

9 


-g(z)2r(z)  -g(z)q(z))dz 


(5) 


ffowever,  directly  solving  the  problem  (5)  is  not  possible  due  to  the  following  two 
reasons.  The  first  reason  is  that  finding  the  minimizer  over  all  measurable  functions  is 
not  tractable  in  practice  since  the  search  space  is  too  vast.  To  overcome  this  problem,  we 
restrict  the  search  space  to  some  linear  subspace  Q\ 


g  =  {aTtp(z)  |  ol  =  (a,, . . .  ,ab)T  €  IR6}, 


(6) 


where  a  is  a  parameter  to  be  learned  from  samples,  and  (fi(z)  is  a  basis  function  vector 
such  that 

cp(z)  =  (ifi(z), . .  .,<pb(z))T  >  0b  for  all  z. 

0b  denotes  the  6-dimensional  vector  with  all  zeros.  Note  that  <p(z)  could  be  dependent 
on  the  samples  {zj}f=1,  i.e.,  kernel  models  are  also  allowed.  We  explain  how  the  basis 
functions  ip(z)  are  chosen  in  Section  2.3. 

The  second  reason  why  directly  solving  the  problem  (5)  is  not  possible  is  that  the 
expectations  over  the  true  probability  densities  q(z)  and  r(z)  cannot  be  computed  since 
q(z)  and  r(z)  are  unknown.  To  cope  with  this  problem,  we  approximate  the  expectations 
by  their  empirical  averages — then  the  optimization  problem  is  reduced  to 

ol  :=  argmin 
c*eR6 


1  i -  T 

— ol  Ho  —  h  o  +  \o  Ro 


(7) 


where  we  included  \aT Ra  (A  >  0)  for  avoiding  overfitting.  A  is  called  the  regularization 


Least-squares  Independent  Component  Analysis 


6 


parameter ,  and  R  is  some  positive  definite  matrix.  H  and  h  are  defined  as 

1 


H 


na 


Y  ■  ■  ■ ,  4dd}M^  •  •  •  >  4d))T> 


n 


(8) 

(9) 


i=l 


Differentiating  the  objective  function  in  Eq.(7)  with  respect  to  a  and  equating  it  to  zero, 
we  can  obtain  an  analytic-form  solution  as 

a  =  (H  +  AR^ti. 

Thus,  the  solution  can  be  computed  very  efficiently  just  by  solving  a  system  of  linear 
equations. 

Once  the  density  ratio  (2)  has  been  estimated,  SMI  can  be  approximated  by  plugging 
the  estimated  density  ratio  g(z)  =  aT (p(z)  in  Eq.(4): 


ls  =  -at  h - . 

2  2 

Note  that  we  may  obtain  various  expressions  of  SMI  using  the  following  identities: 

[  9’(*)2r(z)dz  =  [  g’(z)q(z)dz, 


(10) 


g*(z)r(z)dz  =  /  q(z)dz  =  1. 


Ordinary  MI  based  on  the  Kullback-Leibler  divergence  can  also  be  estimated  similarly 
using  the  density  ratio  (Suzuki  et  al.,  2008).  However,  the  use  of  SMI  is  more  advantageous 
due  to  the  analytic-form  solution,  as  described  in  Section  3. 


2.3  Design  of  Basis  Functions  and  Hyper-parameter  Selection 

As  basis  functions,  we  propose  to  use  a  Gaussian  kernel: 


<Pe(z)  =  exp 


z  ~  ve\\‘ 

2a2 


n exp 


m=  1 


{z^-vY 

2a2 


(11) 


where 

{ve  |  vt  =  (41),---^?))T}L1 


are  Gaussian  centers  randomly  chosen  from  {zi\™=l — more  precisely,  we  set  =  zc^, 
where  {c(£)}be=1  are  randomly  chosen  from  {1, . . . ,  n}  without  replacement.  An  advantage 
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of  the  Gaussian  kernel  lies  in  the  factorizability  in  Eq.(ll),  contributing  to  reducing  the 
computational  cost  of  the  matrix  H  significantly: 


i  n 


m=  1 


J^exp 


i=  1 


(4m)-uf})2  +  (4m)-ufy 

2o2 


We  use  the  RKHS  (Reproducing  Kernel  Hilbert  Space)  norm  of  aTcp(z)  induced  by  the 
Gaussian  kernel  as  the  regularization  term  otT Ret,  which  is  a  popular  choice  in  the  kernel 
method  community  (Scholkopf  &  Srnola,  2002): 

\\ve  -  «£'|12' 


Rey  =  exp 


2cr2 


(12) 


In  the  experiments,  we  £x  the  number  of  basis  functions  to 


b  =  min(300,  n) , 

and  choose  the  Gaussian  width  o  and  the  regularization  parameter  A  by  CV  with  grid 
search  as  follows.  First,  the  samples  {zi}™=l  are  divided  into  K  disjoint  subsets  {Zk}k=\ 
of  (approximately)  the  same  size  (we  use  K  =  5  in  the  experiments).  Then  an  estimator 
at-z\zk  is  obtained  using  Z\Z k  (i.e.,  Z  without  Z k)  and  the  approximation  error  for  the 
hold-out  samples  Z \  is  computed: 

t(k-cv)  I^t  yy  -  tT  - 

■hk  ~  2 az\zkHzkcxz\zk  -  hZkOiZ\zk , 

where  the  matrix  Hzk  and  the  vector  hzk  are  dehned  in  the  same  way  as  H  and  h,  but 
computed  only  using  Zk.  This  procedure  is  repeated  for  k  =  1 ,K  and  its  average 
j(k~ cv)  js  compUted: 

1  K 

j(K-CV)  =  _L  j(K-CV) 

J  Zk 
k= 1 

For  parameter  selection,  we  compute  J<K-CV)  for  all  hyper-parameter  candidates  (the 
Gaussian  width  a  and  the  regularization  parameter  A  in  the  current  setting)  and  choose 
the  parameter  that  minimizes  J(A"cvd  We  can  show  that  j(R-cv)  is  an  almost  unbiased 
estimator  of  the  objective  function  in  Eq.(5),  where  the  ‘ahnost’-ness  comes  from  the  fact 
that  the  number  of  samples  is  reduced  in  the  CV  procedure  due  to  data  splitting  (Geisser, 
1975;  Kohave,  1995). 


3  The  LICA  Algorithms 

In  this  section,  we  show  how  the  above  SMI  estimation  idea  could  be  employed  in  the 
context  of  ICA.  Here,  we  derive  two  algorithms,  which  we  call  Least-squares  Independent 
Component  Analysis  (LICA),  for  obtaining  a  minimizer  of  Is  with  respect  to  the  demixing 
matrix  W — one  is  based  on  a  plain  gradient  method  (which  we  refer  to  as  PG-LICA )  and 
the  other  is  based  on  a  natural  gradient  method  for  whitened  samples  (which  we  refer  to 
as  NG-LICA).  A  MATLAB®  implementation  of  LICA  is  available  from 
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http : //www. simplex . t .u-tokyo .ac.jp/~s-taiji/ software/LICA/ index.html 


3.1  Plain  Gradient  Algorithm:  PG-LICA 

Based  on  the  plain  gradient  technique,  an  update  rule  of  W  is  given  by 

w^w~swr 

where  £  (>  0)  is  the  step  size.  As  shown  in  Appendix,  the  gradient  is  given  by 

'T 


dls  dh  ^  1^T 

d\V(ji  dWff  2 


dH  x  dR 

+  A— 7  1  «, 


dW, 


LI' 


dW, 


i,i> 


(13) 


(14) 


where,  for  ue  =  yc{€)  and  yi  =  (y[l\  . . . ,  y-d))T, 


dhl  -  —  -  v{k)Mu{k']  -  v{k,))  exo 

dWkk ,  ~  no^K  i  1  )[  e  Vi  }  P 

i= 1 

dHe,e>  =  _1 
dWk,k>  rv 


\Zj  -  vk\ 
2  a2 


d- 1 


n 

m^k 


n 

^exp 

2=1 


(j<m|  -  i»y>)2 + (zim>  -  v{yy 


2a‘ 


x 


ri(7‘ 


E(  /  (fc)  (fe)w  (fc')  (k')s  .  f  (fc)  (fc)  W  (fc')  (fc'p 

((A  -  ^  )K  -  2/i  )  +  (a  -  4  )K'  -  S/i 

2  =  1 


x  exp 


2u2 


(15) 


(16) 


For  the  regularization  matrix  R  defined  by  Eq.(12),  the  partial  derivative  is  given  by 


dR, 


dWi 


k,k' 


o 


(. k ) 
2  V  t 


(*h /-„.(*') 


■u^  ^)  exp 


II ~ 

2u2 


In  ICA,  scaling  of  components  of  2  can  be  arbitrary.  This  implies  that  the  above 
gradient  updating  rule  can  lead  to  a  solution  with  poor  scaling,  which  is  not  preferable 
from  a  numerical  point  of  view.  To  avoid  possible  numerical  instability,  we  normalize  W 
at  each  gradient  iteration  as 


Wk,k' 


(17) 


In  practice,  we  may  iteratively  perform  line  search  along  the  gradient  and  optimize 
the  Gaussian  width  o  and  the  regularization  parameter  A  by  CV.  A  pseudo  code  of  the 
PG-LICA  algorithm  is  summarized  in  Figure  1. 
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1.  Initialize  demixing  matrix  W  and  normalize  it  by  Eq.(17). 

2.  Optimize  Gaussian  width  o  and  regularization  parameter  A  by  CV. 

3.  Compute  gradient  by  Eq.(14). 

4.  Choose  step-size  e  such  that  Is  (see  Eq.(10))  is  minimized  (line- search). 

5.  Update  W  by  Eq.(13). 

6.  Normalize  W  by  Eq.(17). 

7.  Repeat  2.-6.  until  W  converges. 

Figure  1:  The  LICA  algorithm  with  plain  gradient  descent  (PG-LICA). 


3.2  Natural  Gradient  Algorithm  for  Whitened  Data:  NG-LICA 


The  second  algorithm  is  based  on  a  natural  gradient  technique  (Amari,  1998). 

Suppose  the  data  samples  are  whitened ,  i.e.,  samples  are  transformed  as 


Vi 


(18) 


where  C  is  the  sample  covariance  matrix: 


C  := 


T 


Then  it  can  be  shown  that  a  demixing  matrix  which  eliminates  the  second  order  correlation 
is  an  orthogojial  matrix  (Hyvarinen  et  al.,  2001).  Thus,  for  whitened  data,  the  search  space 
of  W  can  be  restricted  to  the  orthogonal  group  0(d)  without  loss  of  generality. 

The  tangent  space  of  0(d)  at  W  is  equal  to  the  space  of  all  matrices  U  such  that 
WTU  is  skew  symmetric ,  i.e.,  UWT  =  —WUT.  The  steepest  direction  on  this  tangent 
space,  which  is  called  the  natural  gradient ,  is  given  as  follows  (Amari,  1998): 


vS(W)  -  i 


-  w 


d  Is 

dW 


(19) 


where  the  canonical  metric  (G i,  G2)  =  \tr(GlGo)  is  adopted  in  the  tangent  space.  Then 
the  geodesic  from  W  in  the  direction  of  the  natural  gradient  over  0(d)  can  be  expressed 
by 

W  exp  (tWTVIs(W)^)  , 

where  t  G  M  and  ‘exp’  denotes  the  matrix  exponential,  i.e.,  for  a  square  matrix  D, 
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1.  Whiten  the  data  samples  by  Eq.(18). 

2.  Initialize  demixing  matrix  W  and  normalize  it  by  Eq.(17). 

3.  Optimize  Gaussian  width  a  and  regularization  parameter  A  by  CV. 

4.  Compute  the  natural  gradient  V/s  by  Eq.(19). 

5.  Choose  step-size  t  such  that  Is  (see  Eq.(10))  is  minimized  over  the  set  (20). 

6.  Update  W  by  Eq.(21). 

7.  Repeat  3.-6.  until  W  converges. 

Figure  2:  The  LICA  algorithm  with  natural  gradient  descent  (NG-LICA). 

Thus  when  we  perform  line  search  along  the  geodesic  in  the  natural  gradient  direction, 
the  minimize!'  may  be  searched  from  the  set 

jw  exp  tWTVIs(W ))  t  >  0  j  ,  (20) 

i.e.,  t  is  chosen  such  that  Is  (see  Eq.(10))  is  minimized  and  W  is  updated  as 

W  < —  W  exp  ^-tWTVIs(W)^  .  (21) 

Geometry  and  optimization  algorithms  on  more  general  structure,  the  Stiefel  manifold,  is 
discussed  in  more  detail  in  Nishimori  and  Akaho  (2005). 

A  pseudo  code  of  the  NG-LICA  algorithm  is  summarized  in  Figure  2. 


3.3  Remarks 

The  proposed  LICA  algorithms  can  be  regarded  as  an  application  of  the  general  uncon¬ 
strained  least-squares  density-ratio  estimator  proposed  by  Kanamori  et  al.  (2009)  to  SMI 
in  the  context  of  IGA. 

The  optimization  problem  (5)  can  also  be  obtained  following  the  line  of  Nguyen  et  al. 
(2010),  which  addresses  a  divergence  estimation  problem  utilizing  the  Legendre-Fenchel 
duality.  SMI  defined  by  Eq.(l)  can  be  expressed  as 


/s(Zy....Z<P  =  /i(U))2rWd 


1 

z - . 

2 


If  the  Legendre-Fenchel  duality  of  the  convex  function  \x2, 


1  2  (  1  2^ 
-x  =  sup  I  yx  -  -y  1  , 


(22) 
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is  applied  to  |  in  Eq.(22)  in  a  pointwise  manner,  we  have 

Is(Z(1\...,Z(d))  =  sup  I  (^js(z)  ~  \9{zf)r{z)dz 

=  -inf  f  (l,9(z)2q(z) -g(z)r(z)^dz  - 

where  supg  and  infg  are  taken  over  all  measurable  functions. 

SMI  is  closely  related  to  the  kernel  independence  measures  developed  recently  (Gretton 
et  al.,  2005a;  Gretton  et  ah,  2005b;  Fukumizu  et  ah,  2008).  In  particular,  it  has  been 
shown  that  the  Normalized  Cross- Covariance  Operator  (NOCCO)  proposed  in  Fukumizu 
et  ah  (2008)  is  also  an  estimator  of  SMI  for  d  =  2.  However,  there  is  no  reasonable  hyper- 
parameter  selection  method  for  this  and  all  other  kernel-based  independence  measures 
(see  also  Bach  &  Jordan,  2002  and  Fukumizu  et  ah,  2009).  This  is  a  crucial  limitation  in 
unsupervised  learning  scenarios  such  as  IGA.  On  the  other  hand,  cross-validation  can  be 
applied  to  our  method  for  hyper-parameter  selection,  as  shown  in  Section  2.3. 

4  Experiments 

In  this  section,  we  investigate  the  experimental  performance  of  the  proposed  method. 

4.1  Illustrative  Examples 

First,  we  illustrate  how  the  proposed  method  behaves  using  the  following  three  2- 
dimensional  datasets: 

(a)  SubSub-Gaussians:  p(tc)  =  t/ (a;^1^;  — 0.5,  0.5)1/ (a;^2^;  — 0.5,  0.5), 

(b)  SuperSuper-Gaussians:  p{x)  =  L(x^;  0,  l)L(ad2h  0, 1), 

(c)  SubSuper-Gaussians:  p(x)  =  U(x^;  —  0.5, 0.5)L(x^;  0, 1), 

where  U(x;a,b )  (a,  b  G  R,  a  <  b)  denotes  the  uniform  density  on  [a,  b]  and  L(x]p,v) 
{(i  el,D>  0)  denotes  the  Laplace  density  with  mean  //  and  variance  v.  Let  the  number 
of  samples  be  n  =  300  and  we  observe  mixed  samples  {yi}2=  i  through  the  following  mixing 
iUct  u  rix  * 

a  _  (  cos(7t/4)  sin(7r/4)\  _  1  /  1  l\ 

sin(7r/4)  cos(7r/4)y  V-1  V 

The  observed  samples  are  plotted  in  Figure  3.  We  employed  the  NG-LICA  algorithm 
described  in  Figure  2.  Hyper-parameters  cr  and  A  in  LICA  were  chosen  by  5-fold  CV 
from  the  10  values  in  [0.1, 1]  at  regular  intervals  and  the  10  values  in  [0.001, 1]  at  regular 
intervals  in  log  scale,  respectively.  The  regularization  term  was  set  to  the  squared  RKHS 
norm  induced  by  the  Gaussian  kernel,  i.e.,  we  employed  R  defined  by  Eq.(12). 
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(a)  (b)  (c) 

Figure  3:  Observed  samples  (asterisks),  true  independent  directions  (dotted  lines)  and 
estimated  independent  directions  (solid  lines). 


(a) 


(b) 


(0 


Figure  4:  The  value  of  Is  over  iterations. 


(a)  (b)  (c) 

Figure  5:  The  elements  of  the  demixing  matrix  W  over  iterations.  Solid  lines  correspond 
to  i,  Wh2,  W2,i,  and  W2:2,  respectively.  The  dotted  lines  denote  the  true  values. 
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The  true  independent  directions  as  well  as  the  estimated  independent  directions  are 
plotted  in  Figure  3.  Figure  4  depicts  the  value  of  the  estimated  SMI  (10)  over  iterations 
and  Figure  5  depicts  the  elements  of  the  demixing  matrix  W  over  iterations.  The  results 
show  that  estimated  SMI  decreases  rapidly  and  good  solutions  are  obtained  for  all  the 
datasets.  The  reason  why  the  estimated  SMI  in  Figure  4  does  not  decrease  monotonically 
is  that  during  the  natural  gradient  optimization  procedure,  the  hyper-parameters  (A  and 
a)  are  adjusted  by  CV  (see  Figure  2),  which  possibly  causes  an  increase  in  the  objective 
values. 


4.2  Performance  Comparison 

Here  we  compare  our  method  with  some  existing  methods  (KICA,  FICA,  JADE  (Cardoso 
&  Souloumiac,  1993))  on  artificial  and  real  datasets.  We  used  the  datasets  (a),  (b),  and 
(c)  in  Section  4.1,  the  ‘demosig’  dataset  available  in  the  FastICA  package1  for  MATLAB®, 
and  ‘lOhalo’,  ‘Sergio7’,  ‘Speech4’,  and  ‘c5signals’  datasets  available  in  the  ICALAB  sig¬ 
nal  processing  benchmark  datasets2  (Cichocki  &  Amari,  2003).  The  datasets  (a),  (b), 
(c),  ‘demosig’,  Sergio7’,  and  ‘c5signals’  are  artificial  datasets.  The  datasets  ‘lOhalo’  and 
‘Speech4’  are  real  datasets.  We  employed  the  Amari  index  (Amari  et  ah,  1996)  as  the 
performance  measure  (smaller  is  better): 


Amari  index  := 


d 


2 d(d  1)  ni8jXm//  | ra' 

rn, ra—  1 


+ 


l 

d- l’ 


where  orrhrn’  :=  [W A]nhm/  for  an  estimated  demixing  matrix  W.  We  used  the  publicly 
available  MATLAB®  codes  for  KICA3,  FICA1  and  JADE4,  where  default  parameter  set¬ 
tings  were  used.  Hyper-parameters  cr  and  A  in  LICA  were  chosen  by  5-fold  CV  from  the 
10  values  in  [0.1, 1]  at  regular  intervals  and  the  10  values  in  [0.001, 1]  at  regular  intervals 
in  log  scale,  respectively.  R  was  set  as  Eq.(12). 

We  randomly  generated  the  mixing  matrix  A  and  source  signals  for  artificial  datasets, 

and  computed  the  Amari  index  between  the  true  A  and  W  for  W  estimated  by  each 
method.  As  training  samples,  we  used  the  first  n  samples  for  Sergio7  and  c5signals,  and 
the  n  samples  between  the  1001th  and  (1000+n)-th  interval  for  lOhalo  and  Speech4,  where 
we  tested  n  =  200  and  500. 

The  performance  of  each  method  is  summarized  in  Table  2,  which  depicts  the  mean 
and  standard  deviation  of  the  Amari  index  over  50  trials.  NG-LICA  overall  shows  good 
performance.  KICA  tends  to  work  reasonably  well  for  datasets  (a),  (b),  (c)  and  ‘demosig’, 
bnt  it  performs  poorly  for  the  ICALAB  datasets;  this  seems  to  be  caused  by  an  inappro¬ 
priate  choice  of  the  Gaussian  kernel  width  and  local  optima.  On  the  other  hand,  FICA 
and  JADE  tend  to  work  reasonably  well  for  the  ICALAB  datasets,  bnt  performs  poorly 

1  http://www.ds. hut.fi/projects/ica/fastica 

2http:/ /www. bsp.brain.riken.jp/ICALAB/ICALABSignalProc/benchmarks/ 

3http:/ /www. di.ens.fr/~fbach/kernel-ica/index.htm 
4http://perso. telecom-paristech.fr/  cardoso/guidesepsou.html 
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Table  2:  Mean  and  standard  deviation  of  the  Amari  index  (smaller  is  better)  for  the 
benchmark  datasets.  The  datasets  (a),  (b),  and  (c)  are  taken  from  Section  4.1.  The 
‘demosig’  dataset  is  taken  from  the  FastICA  package.  The  ‘lOhalo’,  ‘Sergio7’,  ‘Speech4’, 
and  ‘c5signals’  datasets  are  taken  from  the  ICALAB  benchmarks  datasets.  The  best 
method  in  terms  of  the  mean  Amari  index  and  comparable  ones  based  on  the  one-sided 
t-test  at  the  significance  level  1%  are  indicated  by  boldface. 


dataset 

n 

NG-LICA 

KICA 

FICA 

JADE 

(a) 

200 

0.05(0.03) 

0.04(0.02) 

0.06(0.03) 

0.04(0.02) 

500 

0.03(0.01) 

0.03(0.01) 

0.03(0.02) 

0.02(0.01) 

(b) 

200 

0.06(0.04) 

0.12(0.15) 

0.16(0.20) 

0.15(0.17) 

500 

0.04(0.03) 

0.05(0.04) 

0.11(0.12) 

0.05(0.04) 

(c) 

200 

0.08(0.05) 

0.09(0.06) 

0.14(0.11) 

0.13(0.09) 

500 

0.04(0.03) 

0.04(0.03) 

0.09(0.08) 

0.10(0.06) 

demosig 

200 

0.04(0.01) 

0.05(0.11) 

0.08(0.05) 

0.08(0.08) 

500 

0.02(0.01) 

0.04(0.09) 

0.04(0.03) 

0.04(0.02) 

lOhalo 

200 

0.29(0.02) 

0.38(0.03) 

0.33(0.07) 

0.36(0.00) 

500 

0.22(0.02) 

0.37(0.03) 

0.22(0.03) 

0.28(0.00) 

Sergio7 

200 

0.04(0.01) 

0.38(0.04) 

0.05(0.02) 

0.07(0.00) 

500 

0.05(0.02) 

0.37(0.03) 

0.04(0.01) 

0.04(0.00) 

Speech4 

200 

0.18(0.03) 

0.29(0.05) 

0.20(0.03) 

0.22(0.00) 

500 

0.07(0.00) 

0.10(0.04) 

0.10(0.04) 

0.06(0.00) 

c5signals 

200 

0.12(0.01) 

0.25(0.15) 

0.10(0.02) 

0.12(0.00) 

500 

0.06(0.04) 

0.07(0.06) 

0.04(0.02) 

0.07(0.00) 

for  (a),  (b),  (c)  and  ‘demosig’;  we  conjecture  that  the  contrast  functions  in  FICA  and  the 
fourth-order  statistics  in  JADE  did  not  appropriately  catch  the  non-Gaussianity  of  the 
datasets  (a),  (b),  (c)  and  ‘demosig’.  Overall,  the  proposed  LICA  algorithm  is  shown  to 
be  a  promising  ICA  method. 


5  Conclusions 

In  this  paper,  we  proposed  a  new  ICA  method  based  on  a  squared-loss  variant  of  mu¬ 
tual  information.  The  proposed  method,  named  least-squares  ICA  (LICA),  has  several 
preferable  properties,  e.g.,  it  is  distribution- free  and  hyper-parameter  selection  by  cross- 
validation  is  available. 

Similarly  to  other  ICA  algorithms,  the  optimization  problem  involved  in  LICA  is  non- 
convex.  Thus  it  is  practically  very  important  to  develop  good  heuristics  for  initialization 
and  avoiding  local  optima  in  the  gradient  procedures,  which  is  an  open  research  topic  to 
be  investigated.  Moreover,  although  our  SMI  estimator  is  analytic,  the  LICA  algorithm  is 
still  computationally  rather  expensive  clue  to  linear  equations  and  cross-validation.  Our 
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future  work  will  address  the  computational  issue,  e.g.,  by  vectorization  and  parallelization. 


Appendix:  Derivation  of  the  Gradient  of  the  SMI  Es¬ 
timator 


Here  we  show  the  derivation  of  the  gradient  (14)  of  the  SMI  estimator  (10).  Since  Is  = 
a  +  4  (see  Eq.(10)),  the  derivative  of  Is  with  respect  to  is  given  as  follows: 

dls  1~t  da.  1^T  dh 

dWk,k'  2  dWky  2  dWk.k/ 

Remind  that  —  =  —  B(x)~ldB^L) B(x)-1  for  an  arbitrary  matrix  function  B(x). 
Then  the  partial  derivative  of  a  =  (H  +  A R)  lh  with  respect  to  Wk,k>  is  given  by 

,v;  ,  +  dh 

dWk# 


1  dh 
2a  dWktW 

1^T  dH  ^  A^t  dR  ^  dh 

-  2a  m^a  +  a  dWk^d 

which  gives  Eq.(14). 


dW, 


k,k' 


=  —  {n  -f  Art) 


dW, 


k,k' 


ym  -f  Art)  n  yn  Art) 


=  -(S  +  18,-M5+(S  +  lil,- 


d  Wk,k> 

Substituting  this  in  Eq.(23),  we  have 
dh  1-t 


dW, 


k,k' 


^H  +  XRA  +  {H  +  xnr  dh 


dW, 


k,k' 


dWi 


+ 


k,k' 


Acknowledgments 

The  authors  would  like  to  thank  Dr.  Takafumi  Kanamori  for  his  valuable  comments. 
T.S.  was  supported  in  part  by  the  JSPS  Research  Fellowships  for  Young  Scientists  and 
Global  COE  Program  “The  research  and  training  center  for  new  development  in  mathe¬ 
matics”,  MEXT,  Japan.  M.S.  acknowledges  support  from  SCAT,  AOARD,  and  the  JST 
PRESTO  program. 


References 

Ali,  S.  M.j  &  Silvey,  S.  D.  (1966).  A  general  class  of  coefficients  of  divergence  of  one 
distribution  from  another.  Journal  of  the  Royal  Statistical  Society,  Series  B ,  28,  131— 
142. 


Least-squares  Independent  Component  Analysis 


16 


Amari,  S.  (1998).  Natural  gradient  works  efficiently  in  learning.  Neural  Computation ,  10, 
251-276. 

Amari.  S.,  Cichocki,  A.,  &  Yang,  H.  H.  (1996).  A  new  learning  algorithm  for  blind  signal 
separation.  Advances  in  Neural  Information  Processing  Systems  (pp.  757-763).  MIT 
Press. 

Bach,  F.  R.,  &  Jordan,  M.  I.  (2002).  Kernel  independent  component  analysis.  Journal  of 
Machine  Learning  Research,  3,  1-48. 

Cardoso,  J.-F.,  &  Souloumiac,  A.  (1993).  Blind  beamforming  for  non-Gaussian  signals. 
Radar  and  Signal  Processing,  IEE  Proceedings-F,  lfO,  362-370. 

Cichocki,  A.,  &  Amari,  S.  (2003).  Adaptive  blind  signal  and  image  processing:  Learning 
algorithms  and  applications.  Wiley. 

Cichocki,  A.,  Zdunek,  R.,  Phan,  A.  H.,  &  Amari,  S.  (2009).  Non-negative  matrix  and 
tensor  factorizations:  Applications  to  exploratory  multi-way  data  analysis  and  blind 
source  separation.  New  York:  Wiley. 

Comon,  P.  (1994).  Independent  component  analysis,  a  new  concept?  Signal  Processing, 
36,  287-314. 

Csiszar,  I.  (1967).  Information-type  measures  of  difference  of  probability  distributions 
and  indirect  observation.  Studia  Scientiarum  Mathematicarum  Hungarica,  2,  229-318. 

Fukumizu,  K.,  Bach,  F.  R.,  &  Jordan,  M.  I.  (2009).  Kernel  dimension  reduction  in 
regression.  The  Annals  of  Statistics,  31,  1871-1905. 

Fukumizu,  K.,  Gretton,  A.,  Sun,  X.,  &  Scholkopf,  B.  (2008).  Kernel  measures  of  condi¬ 
tional  dependence.  Advances  in  Neural  Information  Processing  Systems  20  (pp.  489- 
496).  Cambridge,  MA:  MIT  Press. 

Geisser,  S.  (1975).  The  predictive  sample  reuse  method  with  applications.  Journal  of  the 
American  Statistical  Association,  70,  320-328. 

Gretton,  A.,  Bousquet,  O.,  Smola,  A.,  &  Scholkopf,  B.  (2005a).  Measuring  statistical 
dependence  with  Hilbert-Schmidt  norms.  Algorithmic  Learning  Theory  (pp.  63-77). 
Berlin:  Springer- Verlag. 

Gretton,  A.,  Herbrich,  R.,  Smola,  A.,  Bousquet,  O.,  &  Scholkopf,  B.  (2005b).  Kernel 
methods  for  measuring  independence.  Journal  of  Machine  Learning  Research,  6,  2075- 
2129. 

Hulle,  M.  M.  V.  (2008).  Sequential  fixed-point  ICA  based  on  mutual  information  mini¬ 
mization.  Neural  Computation,  20,  1344-1365. 


Least-squares  Independent  Component  Analysis 


17 


Hyvarinen,  A.  (1999).  Fast  and  robust  fixed-point  algorithms  for  independent  component 
analysis.  IEEE  Transactions  on  Neural  Networks,  10,  626-634. 

Hyvarinen,  A.,  Karhunen,  J.,  &  Oja,  E.  (2001).  Independent  component  analysis.  New 
York:  Wiley. 

Jutten,  C.,  &  Herault,  J.  (1991).  Blind  separation  of  sources,  part  I:  An  adaptive  algorithm 
based  on  neuromimetic  architecture.  Signal  Processing,  24,  1-10. 

Kanamori,  T.,  Hido,  S.,  &  Sugiyama,  M.  (2009).  A  least-squares  approach  to  direct 
importance  estimation.  Journal  of  Machine  Learning  Research,  10,  1391-1445. 

Kohave,  R.  (1995).  A  study  of  cross-validation  and  bootstrap  for  accuracy  estimation  and 
model  selection,  the  Fourteenth  International  Joint  Conference  on  Artificial  Intelligence 
(pp.  1137-1143).  Morgan  Kaufmann. 

Kullback,  S.,  &  Leibler,  R.  A.  (1951).  On  information  and  sufficiency.  Annals  of  Mathe¬ 
matical  Statistics,  22,  79-86. 

Liese,  F.,  &  Vajda,  I.  (2006).  On  divergences  and  informations  in  statistics  and  informa¬ 
tion  theory.  IEEE  Transactions  on  Information  Theory,  52,  4394-4412. 

Nguyen,  X.,  Wainwright,  M.  J.,  &  Jordan,  M.  I.  (2010).  Estimating  divergence  functionals 
and  the  likelihood  ratio  by  convex  risk  minimization.  IEEE  Transactions  on  Information 
Theory,  to  appear. 

Nishimori,  Y.,  &  Akaho,  S.  (2005).  Learning  algorithms  utilizing  quasi-geodesic  flows  on 
the  Stiefcl  manifold.  Neurocomputing,  67,  106-135. 

Paninsky,  L.  (2003).  Estimation  of  entropy  and  mutual  information.  Neural  Computation, 
15,  1191-1253. 

Pearson,  K.  (1900).  On  the  criterion  that  a  given  system  of  deviations  from  the  probable 
in  the  case  of  a  correlated  system  of  variables  is  such  that  it  can  be  reasonably  supposed 
to  have  arisen  from  random  sampling.  Philosophical  Magazine,  50,  157-172. 

Scholkopf,  B.,  &  Smola,  A.  J.  (2002).  Learning  with  kernels.  Cambridge,  MA:  MIT  Press. 

Sugiyama,  M.,  Suzuki,  T.,  Nakajima,  S.,  Kashima,  H.,  von  Biinau,  P.,  &  Kawanabe, 
M.  (2008).  Direct  importance  estimation  for  covariate  shift  adaptation.  Annals  of  the 
Institute  of  Statistical  Mathematics,  60,  699-746. 

Suzuki,  T.,  Sugiyama,  M.,  Sese,  J.,  &  Kanamori,  T.  (2008).  Approximating  mutual  in¬ 
formation  by  maximum  likelihood  density  ratio  estimation.  New  Challenges  for  Feature 
Selection  in  Data  Mining  and  Knowledge  Discovery  (pp.  5-20). 

Vapnik,  V.  N.  (1998).  Statistical  learning  theory.  New  York:  Wiley. 


IEICE  Transactions  on  Information  and  Systems, 
vol.E93-D,  no.  10,  pp.2690-2701,  2010. 

Revised  on  January  6,  2011. 


1 


Superfast-Trainable  Multi-Class  Probabilistic  Classifier 
by  Least-Squares  Posterior  Fitting 

Masashi  Sugiyama  (sugi@cs.titech.ac.jp) 

Tokyo  Institute  of  Technology 
and 

Japan  Science  and  Technology  Agency 


Abstract 

Kernel  logistic  regression  (KLR)  is  a  powerful  and  flexible  classification  algorithm, 
which  possesses  an  ability  to  provide  the  confidence  of  class  prediction.  How¬ 
ever,  its  training — typically  carried  out  by  (quasi-)Newton  methods — is  rather  time- 
consuming.  In  this  paper,  we  propose  an  alternative  probabilistic  classification 
algorithm  called  Least-Squares  Probabilistic  Classifier  (LSPC).  KLR  models  the 
class-posterior  probability  by  the  log-linear  combination  of  kernel  functions  and  its 
parameters  are  learned  by  (regularized)  maximum  likelihood.  In  contrast,  LSPC 
employs  the  linear  combination  of  kernel  functions  and  its  parameters  are  learned 
by  regularized  least-squares  fitting  of  the  true  class-posterior  probability.  Thanks 
to  this  linear  regularized  least-squares  formulation,  the  solution  of  LSPC  can  be 
computed  analytically  just  by  solving  a  regularized  system  of  linear  equations  in  a 
class-wise  manner.  Thus  LSPC  is  computationally  very  efficient  and  numerically 
stable.  Through  experiments,  we  show  that  the  computation  time  of  LSPC  is  faster 
than  that  of  KLR  by  orders  of  magnitude,  with  comparable  classification  accuracy. 

Keywords 

Probabilistic  classification,  kernel  logistic  regression,  class-posterior  probability,  squared- 
loss. 


1  Introduction 

The  support  vector  machine  (SVM)  [7,  33]  is  a  popular  method  for  classification.  Various 
computationally  efficient  algorithms  for  training  SVM  with  massive  datasets  have  been 
proposed  so  far  (see  [24,  16,  5,  6,  29,  26,  32,  13,  11,  30,  17,  31,  12]  and  many  other  softwares 
available  online).  However,  SVM  cannot  provide  the  confidence  of  class  prediction  since  it 
only  learns  the  decision  boundaries  between  different  classes.  To  cope  with  this  problem, 
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several  post-processing  methods  have  been  developed  for  approximately  computing  the 
class-posterior  probability  [25,  34], 

On  the  other  hand,  logistic  regression  (LR)  is  a  classification  algorithm  that  can  nat¬ 
urally  give  the  confidence  of  class  prediction  since  it  directly  learns  the  class-posterior 
probabilities  [15].  Recently,  various  efficient  algorithms  for  training  LR  models  special¬ 
ized  in  sparse  data  have  been  developed  [22,  10]. 

Applying  the  kernel  trick  to  LR  as  done  in  SVM,  one  can  easily  obtain  a  non-linear 
classifier  with  probabilistic  outputs,  called  kernel  logistic  regression  (KLR).  Since  the  ker¬ 
nel  matrix  is  often  dense  (e.g.,  Gaussian  kernels),  the  state-of-the-art  LR  algorithms  for 
sparse  data  are  not  applicable  to  KLR.  Thus,  in  order  to  train  KLR  classifiers,  standard 
non-linear  optimization  techniques  such  as  Newton’s  method  (more  specifically,  iteratively 
reweighted  least- squares)  and  quasi-Newton  methods  (for  example,  the  Broyden-Fletcher- 
Goldfarb-Shanno  (BFGS)  method)  seem  to  be  commonly  used  in  practice  [15,  23].  Al¬ 
though  the  performance  of  these  general-purpose  non-linear  optimization  techniques  has 
been  improved  together  with  the  evolution  of  computer  environment  in  the  last  decade, 
computing  the  KLR  solution  is  still  challenging  when  the  number  of  training  samples  is 
large.  The  purpose  of  this  paper  is  to  propose  an  alternative  probabilistic  classification 
method  that  can  be  trained  very  efficiently. 

Our  proposed  method  is  called  the  Least-Squares  Probabilistic  Classifier  (LSPC).  In 
LSPC,  we  use  a  linear  combination  of  Gaussian  kernels  centered  at  training  points  as  a 
model  of  class-posterior  probabilities.  Then  we  fit  this  model  to  the  true  class-posterior 
probability  by  least-squares1.  An  advantage  of  this  linear  least-squares  formulation  is 
that  consistency  is  guaranteed  without  taking  into  account  the  normalization  factor.  In 
contrast,  normalization  is  essential  in  the  maximum-likelihood  LR  formulation;  otherwise 
the  likelihood  tends  to  infinity.  Thanks  to  the  simplification  brought  by  excluding  the 
normalization  factor  from  the  optimization  criterion,  we  can  compute  the  globally  optimal 
solution  of  LSPC  analytically  just  by  solving  a  system  of  linear  equations. 

Furthermore,  we  show  that  the  use  of  a  linear  combination  of  kernel  functions  in 
LSPC  allows  us  to  learn  the  parameters  in  a  class-wise  manner.  This  highly  contributes  to 
further  reducing  the  computational  cost  particularly  in  multi-class  classification  scenarios. 
Through  experiments,  we  show  that  LSPC  is  computationally  much  more  efficient  than 
KLR  with  comparable  accuracy. 


2  Least-squares  Approach  to  Probabilistic  Classifica¬ 
tion 

In  this  section,  we  formulate  the  problem  of  probabilistic  classification  and  give  a  new 
method  in  the  least-squares  framework. 


1A  least-squares  formulation  has  been  employed  for  improving  the  computational  efficiency  of  SVMs 
[29,  26,  13].  However,  these  approaches  deal  with  deterministic  classification,  not  probabilistic  classifica¬ 
tion. 
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2.1  Problem  Formulation 


Let  X  (c  Md)  be  the  input  domain,  where  d  is  the  dimensionality  of  the  input  domain. 
Let  y  —  {1, . . . ,  c}  be  the  set  of  labels,  where  c  is  the  number  of  classes.  Let  us  consider 
a  joint  probability  distribution  on  X  x  y  with  joint  probability  density  p(x,y).  Suppose 
that  we  are  given  n  independent  and  identically  distributed  (i.i.d.)  paired  samples  of 
input  x  and  output  y: 

{(Xi,yi)exxy}ri Li. 


The  goal  is  to  estimate  the  class-posterior  probability  p(y\x)  from  the  samples 
{(cCj,  7/i)}”=1.  The  class-posterior  probability  allows  us  to  classify  test  sample  x  to  class  y 
with  confidence  p(y\x): 

y  :=  argmaxp(y\x). 
y 

Let  us  denote  the  marginal  density  of  x  by  p(x)  and  we  assume  that  it  is  strictly 
positive: 

p{x)  >  0  for  all  x  G  X. 


Then,  by  definition,  the  class-posterior  probability  p(y\x)  can  be  expressed  as 


p(y\x) 


P(x,y) 

P(x) 


This  expression  will  be  utilized  in  the  derivation  of  the  proposed  method  below. 


(1) 


2.2  Linear  Least-squares  Fitting  of  Class-posterior  Probability 

Here  we  introduce  our  least-squares  fitting  idea.  We  begin  with  the  formulation  for 
learning  the  class-posterior  probability  p(y\x)  as  a  function  of  both  x  and  y,  i.e.,  the 
class-posterior  probabilities  for  all  classes  are  learned  simultaneously.  Then  in  Section  2.3, 
we  show  that  this  simultaneous  learning  problem  can  be  decomposed  into  independent 
class-wise  learning  problems,  which  highly  contributes  to  reducing  the  computational  cost. 
We  model  the  class-posterior  probability  p(y\x)  by  the  following  linear  model: 

b 

q(y\x ;  a)  :=  Y  aeMx>  y)  =  otT<t>{x,  y), 
e= i 

where  T  denotes  the  transpose  of  a  matrix  or  a  vector, 

a.  =  (ai, . . .  ,ab)T 

are  parameters  to  be  learned  from  samples,  and 

4>(x,y)  =  (Mx,y),---,Mx,y))T 

are  basis  functions  such  that 


4>(x,y )  >  Ofc  for  all  (x,y)  G  X  x  y. 
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Of,  denotes  the  6-dimensional  vector  with  all  zeros  and  the  inequality  for  vectors  is  applied 
in  the  element-wise  manner.  We  explain  how  the  basis  functions  c p(x,y )  are  practically 
chosen  in  Section  2.3. 

We  determine  the  parameter  a  in  the  model  q(y\x;a)  so  that  the  following  squared 
error  J  is  minimized: 


J(a)  :=  sYl  /  (?(z/k;«)  -  p(y\x))2 p(x)dx 


y=l 


C  n  C  n 

^  /  q(y\x;  a)2p(x)dx  —  ^  /  q(y\x;  a)p(x,  y)dx  +  Const. 
y= i '  y= i  ' 


=  -a  1  Ha  —  hT  a  +  Const., 
where  we  used  Eq.(l).  The  6x6  matrix  H  and  the  6-dimensional  vector  h  are  defined  as 


II 

h 


H  and  h  contain  the  expectations  over  unknown  densities  p(x)  and  p(x,y),  so  we  ap¬ 
proximate  the  expectations  by  sample  averages.  Then  we  have 

^  c  n 

H  :=  ~^^j(t){xily)(j){xi,y)T , 

Tl  y= i  i=i 

1  n 

h  :=  y ^(f>(xi,yi). 


Now  our  optimization  criterion  is  formulated  as 


a  :=  argnnn 


1  i T 

a  Ha  —  h  a  +  A  a  a 

2 


where  a  regularizer  A aT a  (A  >  0)  is  included  for  regularization  purposes.  Taking  the 
derivative  of  the  above  objective  function  and  equating  it  to  zero,  we  see  that  the  solution 
a  can  be  obtained  just  by  solving  the  following  system  of  linear  equations. 

{H  +  A  Ib)a  =  h,  (2) 

where  Jf,  denotes  the  6-dimensional  identity  matrix.  Thus,  the  solution  a  is  given  ana¬ 
lytically  as 

a  =  {H  +  \Ib)-lh. 
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In  order  to  assure  that  the  solution  q(y\x\cx)  is  a  conditional  probability,  we  round  up 
negative  outputs  to  zero  [35]  and  renormalize  the  solution.  Consequently,  our  final  solution 
is  expressed  as 


p(y\x) 


max(0,  aT <f>(x,  y) 
Ey'= 1  max(0,  aT y'))  ’ 


(3) 


We  call  the  above  method  Least-Squares  Probabilistic  Classifier  (LSPC).  LSPC  can  be 
regarded  as  an  application  of  a  density  ratio  estimation  method  called  the  unconstrained 
Least-Squares  Importance  Fitting  (uLSIF)  [18]  to  probabilistic  classification.  Thus  all  the 
theoretical  properties  of  uLSIF  such  as  consistency,  the  rate  of  convergence,  and  numerical 
stability  [19,  20]  may  be  directly  translated  into  the  current  context. 


2.3  Basis  Function  Design 

A  naive  choice  of  basis  functions  c fi(x,y )  would  be  a  kernel  model,  i.e. ,  for  some  kernel 
function  K1, 

c  n 

q{y\x ;  a)  =  )k'(x > x «>  y,  s/),  (4) 

y'= 1  1=1 

which  contains  cn  parameters.  For  this  model,  the  computational  complexity  for  solving 
Eq.(2)  is  0(c3n3). 

Here  we  propose  to  separate  input  x  and  output  y,  and  use  the  delta  kernel  for  y  (as 
in  KLR): 

c  n 

q(y\x-,oc)  =  ^2^2a{ey)K(x,xe)5yV, 

y'= 1  1=1 

where  K  is  a  kernel  function  for  x  and  Sy.y>  is  the  Kronecker  delta : 


6 


y,y'  ~ 


if  y  =  y', 

otherwise. 


This  model  choice  actually  allows  us  to  speed  up  the  computation  of  LSPC  significantly 
since  all  the  calculations  can  be  carried  out  separately  in  a  class-wise  manner.  Indeed, 
the  above  model  for  class  y  is  expressed  as 

n 

q(y\x;a )  =  ^a^Kfaxi).  (5) 

£= 1 

Then  the  matrix  H  becomes  block-diagonal,  as  illustrated  in  Figure  1(a).  Thus  we  only 
need  to  train  a  model  with  n  parameters  separately  c  times  for  each  class  y,  by  solving 
the  following  equation: 

C H  +\In)aM  =  h{V\ 
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Kernels  Kernels  Kernels 
for  class  1  for  class  2  for  class  3 


Samples  Samples  Samples 
in  class  1  in  class  2  in  class  3 


r-S  r-S  rS 


s(1) 

Hm 

— 

(a)  Model  (5)  (b)  Model  (7) 

Figure  1:  Structure  of  matrix  H  for  model  (5)  and  model  (7).  The  number  of  classes  is 
c  =  3.  Suppose  training  samples  {(ajj,  2/i)}”=i  are  sorted  according  to  label  y.  Colored 
blocks  are  non-zero  and  others  are  zeros.  For  model  (5)  consisting  of  c  sets  of  n  basis 
functions,  the  matrix  H  becomes  block-diagonal  (with  common  block  matrix  H  ),  and 
thus  training  can  be  carried  out  separately  for  each  block.  For  model  (7)  consisting  of  c 
sets  of  ny  basis  functions,  the  size  of  the  target  block  is  further  reduced. 

- — "/  #  e  — ( y ) 

where  H  is  the  n  x  n  matrix  and  h  is  the  n-dimensional  vector  defined  as 

1  n 

H'ee  :=  -  'y]K(xi,Xi)K(xi,xii), 

’  n 

i=  1 

:=  -J2K(Xi’XdSV,vr 

i—1 

Since  H  is  common  to  all  y,  we  only  need  to  compute  (H  +  A/n)_1  once.  Then  the 
computational  complexity  for  obtaining  the  solution  is  0(n3  +  cn2),  which  is  smaller  than 
the  case  with  general  kernel  model  (4).  Thus  this  approach  would  be  computationally 
efficient  when  the  number  of  classes  c  is  large. 

Here,  we  further  propose  to  reduce  the  number  of  kernels  in  model  (5).  To  this  end, 
we  focus  on  a  kernel  function  K(x,xr)  that  is  “localized”.  Examples  of  such  localized 
kernels  include  the  popular  Gaussian  kernel  [28]: 

K(x,  x ')  =  exp  )  •  (6) 

Our  idea  is  to  reduce  the  number  of  kernels  by  locating  the  kernels  only  at  samples 
belonging  to  the  target  class: 

q(y\x-(x)  = 

t=  l 


(7) 
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Figure  2:  Heuristic  of  reducing  the  number  of  basis  functions — locate  Gaussian  kernels 
only  at  the  samples  of  the  target  class. 


where  ny  is  the  number  of  training  samples  in  class  y,  and  {x\y is  the  training  input 
samples  in  class  y. 

The  rationale  behind  this  model  simplification  is  as  follows  (Figure  2).  By  definition, 
the  class-posterior  probability  p(y\x)  takes  large  values  in  the  regions  where  samples  in 
class  y  are  dense;  conversely,  p(y\x)  takes  smaller  values  (i.e.,  close  to  zero)  in  the  regions 
where  samples  in  class  y  are  sparse.  When  a  non-negative  function  is  approximated  by  a 
Gaussian  kernel  model,  many  kernels  may  be  needed  in  the  region  where  the  output  of 
the  target  function  is  large;  on  the  other  hand,  only  a  small  number  of  kernels  would  be 
enough  in  the  region  where  the  output  of  the  target  function  is  close  to  zero.  Following 
this  heuristic,  many  kernels  are  allocated  in  the  region  where  p(y\x)  takes  large  values, 
which  can  be  achieved  by  Eq.(7). 

This  model  simplification  allows  us  to  further  reduce  the  computational  cost  since  the 
size  of  the  target  blocks  in  matrix  H  is  further  reduced,  as  illustrated  in  Figure  1(b).  In 
order  to  learn  the  ny- dimensional  parameter  vector 

«(2/)  =  ■  ■  ■  ^nyY 

for  each  class  y,  we  only  need  to  solve  the  following  system  of  ny  linear  equations: 

(H(V)  +  \Iny)aM  =  h(y\  (8) 

- — -( y )  #  p 

where  H  is  the  ny  x  ny  matrix  and  h  ‘  is  the  n^-dimensional  vector  defined  as 

_  1  n 

'■=  ~  K(Xh  x?)K(xh  (9) 

2=1 
ny 

2=1 


Let  be  the  solution  of  Eq.(8).  Then  our  final  solution  is  given  by 


p{y\x)  = 


max(0,  E":,  S)y]K( 


X.  X n 


>)) 


i  max(°>  Ei= i  (x,  xf1)) 


Sv'h 


(10) 
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Input:  Labeled  training  samples  {(a;*, r/i)}”=1 

(equivalently,  for  class  y  —  1, ...  ,c), 

Gaussian  width  cr,  and  regularization  parameter  A; 
Output:  Class-posterior  probability  p{y\x)\ 

for  y  =  1, . . . ,  c 

r(y)  ,  -*■ 

2u2 


H, 


Li’ 


h 


(y) 


J^exp 

i=  1 
1  ny 


i„.  _  _(y)i|2  ,  ||  _  Mv) 1 1 2 ' 

|  l  •*-' g  II  I  11*^2  I 


for  £,  £'  =  !,. 


•  -i  ny> 


n 


^2  exp 


\x[y)  _  x(y)  |i2 ' 


2=1 


2<r2 

Solve  linear  equation  (H<V)  +  XI „  )ol^  =  h!V>  and  obtain  a^; 


for  £  =  1, ,  n, , ; 


end 


p(y|*) 


max 


o,  ^2  exp 


\x  —  X 


(y)  ii  2 ' 


t= i 


2cr2 


max  ( °>  ^2 

y’= i 


-(2/0 

a)  exp 


£C  —  a? 


(2/0  II 2' 


2=1 


2<v2 


Figure  3:  Pseudo  code  of  LSPC  for  simplified  model  (7)  with  Gaussian  kernel  (6). 


For  the  simplified  model  (7),  the  computational  complexity  for  obtaining  the  solution 
is  0(criyn ) — when  ny  =  n/c  for  all  y ,  this  is  equal  to  (P(c_1n3).  Thus  this  approach  is 
computationally  highly  efficient  for  multi-class  problems. 

A  pseudo  code  of  the  simplest  LSPC  implementation  for  Gaussian  kernels  is  summa¬ 
rized  in  Figure  3.  Its  MATLAB®  implementation  is  available  from 

http : //sugiyama-www. cs . titech . ac . jp/~sugi/software/LSPC/ 


3  Experiments 

In  this  section,  we  experimentally  compare  the  performance  of  the  following  classification 
methods: 

•  LSPC:  LSPC  with  model  (7). 

•  LSPC  (full):  LSPC  with  model  (5). 

•  KLR:  .^-penalized  kernel  logistic  regression  with  Gaussian  kernels.  We  used  a 
MATLAB®  implementation  included  in  the  ‘ minFunc ’  package  [27],  which  uses 
limited- memory  BFGS  updates  with  Shanno-Phua  scaling  in  computing  the  step 
direction  and  a  bracketing  line-search  for  a  point  satisfying  the  strong  Wolfe  condi¬ 
tions  to  compute  the  step  direction. 
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When  we  fed  data  to  learning  algorithms,  the  input  samples  were  normalized  in  the 
element-wise  manner  so  that  each  element  has  mean  zero  and  unit  variance.  The  Gaussian 
width  a  and  the  regularization  parameter  A  for  all  the  methods  are  chosen  based  on  2-fold 
cross-validation  from 


o  G  {  jqIti,  \m,  |m,  | m,  m,  § m,  2m,  5m,  10m}, 
A  e  (KT2, 10-1'5, 10“\ 10~0'5, 10°}, 

where 

m  :=  median({||cci  -  Xj\\}™j=1). 


3.1  Illustrative  Examples 

First,  we  illustrate  the  behavior  of  each  method  using  a  toy  dataset. 

We  set  the  dimension  of  the  input  space  to  cl  =  2  and  the  number  of  classes  to  c  =  3. 
We  independently  drew  samples  in  each  class  from  the  following  class-conditional  sample 
densities  (see  Figure  4): 


p(x\y  =  1)  =  N  ^ 
p(x\y  =  2)  =  N  ^ 
p(x\y  =  3  )  =  ^N 
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where  N( cc; /Lt,  S)  denotes  the  Gaussian  density  with  mean  p  and  covariance  matrix  X. 
We  set  the  class-prior  probabilities  p(y )  as 


p(y) 


1/4  if  2/  =  1,  2, 
1/2  if  2/  =  3, 


and  we  set  the  number  of  training  samples  to  n  =  200.  Generated  samples  are  plotted  in 
Figure  5. 

The  true  class-posterior  probabilities  p(y\x)  (oc  p(x\y)p(y)) ,  their  estimates  obtained 
by  LSPC,  LSPC(full),  and  KLR  are  depicted  in  Figure  6.  The  plots  show  that  all  the 
methods  approximate  the  true  class-posterior  probabilities  well  in  the  training  region 
(say,  [— 5,5]2).  ffowever,  the  output  outside  the  training  region  is  substantially  different 
in  LSPC  and  KLR.  This  is  induced  by  the  difference  of  the  models — a  linear  combination 
of  Gaussian  kernels  is  used  in  LSPC,  while  its  exponent  is  used  in  KLR.  Outside  the 
training  region,  there  is  no  kernel,  and  thus  a  linear  combination  of  Gaussian  kernels 
takes  values  close  to  zero  (note  that  the  values  are  not  exactly  zero  since  Gaussian  tails 
extended  from  training  regions  remain  everywhere);  then  typically  one  of  the  classes  takes 
a  value  close  to  one,  and  the  others  tend  to  zero  outside  the  training  regions.  On  the  other 
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P(x|y=3) 


-10  -5  0  5  10 


x'1* 


Figure  4: 


Illustrative  examples.  Class-conditional  sample  densities  p(x\y). 
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Figure  5:  Illustrative  examples.  Training  samples  are  plotted  with  filled  symbols.  Unfilled 
symbols  denote  the  classification  results  based  on  the  true  class-posterior  probabilities  and 
their  estimates  obtained  by  LSPC,  LSPC(full),  and  KLR. 
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p-hat(y=1  |x)  by  LSPC(full) 
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Figure  6:  Illustrative  examples.  The  plots  show  the  true  class-posterior  probabilities 
p(y\x),  their  estimates  by  LSPC,  LSPC(full),  and  KLR  from  top  to  bottom,  and  y  =  1,  2,  3 
from  left  to  right. 
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hand,  KLR  outputs  values  close  to  one  outside  the  training  region  since  exp(O)  =  1;  then 
they  are  normalized  and  thus  are  reduced  to  1/c. 

The  classification  results  based  on  the  true  class-posterior  probabilities  and  their  esti¬ 
mates  obtained  by  LSPC,  LSPC(full),  and  KLR  are  plotted  in  Figure  5.  This  shows  that 
all  the  method  gave  reasonable  classification  results. 

3.2  Performance  Comparison 

Next,  we  evaluate  the  classification  accuracy  and  computation  time  of  each  method  using 
the  following  multi-class  classification  datasets  taken  from  the  LIBSVM  web  page  [5]: 

•  mnist:  Input  dimensionality  is  717  and  the  number  of  classes  is  10. 

•  usps:  Input  dimensionality  is  256  and  the  number  of  classes  is  10. 

•  satimage:  Input  dimensionality  is  36  and  the  number  of  classes  is  6. 

•  letter:  Input  dimensionality  is  16  and  the  number  of  classes  is  26. 

We  investigated  the  classification  accuracy  and  computation  time  of  LSPC, 
LSPC(full),  and  KLR.  For  given  n  and  c,  we  randomly  chose  ny  =  \n/c\  training  samples 
from  each  class  y,  where  [tj  is  the  largest  integer  not  greater  than  t.  In  the  first  set  of 
experiments,  we  fixed  the  number  of  classes  c  to  the  original  number  shown  above,  and 
changed  the  number  of  training  samples  as  n  =  100,200,500,1000,2000.  In  the  second 
set  of  experiments,  we  fixed  the  number  of  training  samples  to  n  =  1000,  and  changed 
the  number  of  classes  c — samples  only  in  the  first  c  classes  in  the  dataset  are  used.  The 
classification  accuracy  is  evaluated  using  100  test  samples  randomly  chosen  from  each 
class.  The  computation  time  is  measured  by  the  CPLI  computation  time  required  for 
training  each  classifier  when  the  Gaussian  width  and  the  regularization  parameter  chosen 
by  cross-validation  were  used. 

The  experimental  results  are  summarized  in  Figure  7  and  Figure  8.  The  left  column  in 
Figure  7  shows  that  when  n  is  increased,  the  classification  error  for  all  the  methods  tends 
to  decrease,  and  LSPC,  LSPC  (full),  and  KLR  performed  similarly  well.  The  right  column 
in  Figure  7  shows  that  when  n  is  increased,  the  computation  time  tends  to  grow  for  all 
the  methods.  LSPC  is  faster  than  KLR  by  two  orders  of  magnitude.  The  left  column 
in  Figure  8  shows  that  when  c  is  increased,  the  classification  error  tends  to  increase  for 
all  the  methods,  and  LSPC,  LSPC  (full),  and  KLR  behaved  similarly  well.  The  right 
column  in  Figure  8  shows  that  when  c  is  increased,  the  computation  time  of  KLR  tends 
to  grow,  while  that  of  LSPC  is  kept  constant  or  even  it  tends  to  slightly  decrease.  This 
happened  because  the  number  of  samples  in  each  class  decreases  when  c  is  increased,  and 
the  computation  time  of  LSPC  is  governed  by  the  number  of  samples  in  each  class,  not 
by  the  total  number  of  samples  (see  Section  2.3). 

Overall,  the  computation  of  LSPC  was  shown  to  be  faster  than  that  of  KLR  by  orders 
of  magnitude,  while  LSPC  and  KLR  were  shown  to  be  comparable  to  each  other  in 
terms  of  the  classification  accuracy.  LSPC  and  LSPC  (full)  were  shown  to  possess  similar 
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Figure  7:  Misclassification  rate  (in  percent,  left)  and  computation  time  (in  second,  right) 
as  functions  of  the  number  of  training  samples  n.  From  top  to  bottom,  the  graphs 
correspond  to  the  ‘mnist’,  ‘usps’,  ‘satimage’,  and  ‘letter’  datasets. 
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Misclassification  rate  (mnist,  n=1000) 


Misclassification  rate  (usps,  n=1 000) 


Misclassification  rate  (satimage,  n=1000) 


Misclassification  rate  (letter,  n=1000) 


Figure  8:  Misclassification  rate  (in  percent,  left)  and  computation  time  (in  second,  right) 
as  functions  of  the  number  of  classes  c.  From  top  to  bottom,  the  graphs  correspond  to 
the  ‘mnist’,  ‘usps’,  ‘satimage’,  and  ‘letter’  datasets. 
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classification  performance,  and  thus  a  computationally  efficient  version,  LSPC,  would  be 
more  preferable  in  practice. 

4  Discussion  and  Conclusion 

Recently,  various  efficient  algorithms  for  computing  the  solution  of  logistic  regression  have 
been  developed  for  high- dimensional  sparse  data  [22,  10].  However,  for  dense  data,  using 
standard  non-linear  optimization  techniques  such  as  Newton’s  method  or  quasi-Newton 
methods  seem  to  be  a  common  choice  [15,  23].  The  performance  of  these  general-purpose 
non-linear  optimizers  has  been  improved  in  the  last  decade,  but  computing  the  solution  of 
logistic  regression  for  a  large  number  of  dense  training  samples  is  still  a  challenge  problem. 

In  this  paper,  we  proposed  a  simple  probabilistic  classification  algorithm  called  Least- 
Squares  Probabilistic  Classifier  (LSPC).  LSPC  employs  a  linear  combination  of  Gaussian 
kernels  centered  at  training  points  for  modeling  the  class-posterior  probability  and  the 
parameters  are  learned  by  least-squares.  Notable  advantages  of  LSPC  are  that  its  solution 
can  be  computed  analytically  just  by  solving  a  system  of  linear  equations  and  training 
can  be  carried  out  separately  in  a  class-wise  manner.  In  experiments,  we  showed  that 
LSPC  is  faster  than  kernel  logistic  regression  (KLR)  in  computation  time  by  two  orders 
of  magnitude,  with  comparable  accuracy. 

The  computational  efficiency  of  LSPC  was  brought  by  the  combination  of  appropriate 
model  choice  and  loss  function.  More  specifically,  KLR  uses  a  log-linear  combination  of 
kernel  functions  and  its  parameters  are  learned  by  regularized  maximum  likelihood.  In 
this  log-linear  maximum  likelihood  formulation,  normalization  of  the  model  is  essential 
to  avoid  the  likelihood  diverging  to  infinity.  Thus  the  likelihood  function  tends  to  be 
complicated  and  numerically  solving  the  optimization  problem  may  be  unavoidable.  On 
the  other  hand,  in  LSPC,  we  chose  a  linear  combination  of  Gaussian  kernel  functions 
for  modeling  the  class-posterior  probability  and  its  parameters  are  learned  by  regularized 
least-squares.  This  combination  allowed  us  to  obtain  the  solution  analytically.  When 
Newton’s  method  (more  specifically,  iteratively  reweighted  least- squares)  is  used  for  learn¬ 
ing  the  KLR  model,  a  system  of  linear  equations  needs  to  be  solved  in  every  iteration 
until  convergence  [15].  On  the  other  hand,  LSPC  requires  to  solve  a  system  of  linear 
equations  only  once. 

We  chose  to  separate  the  kernel  for  inputs  and  outputs,  and  adopted  the  delta  kernel 
for  outputs  (see  Eq.(5)).  This  allowed  us  to  perform  the  training  of  LSPC  in  a  class- wise 
manner.  We  showed  that  this  contributes  to  reducing  the  training  time  particularly  in 
multi-class  classification  problems.  We  note  that  this  model  choice  is  essentially  the  same 
as  that  of  KLR2. 

We  further  proposed  to  reduce  the  number  of  kernels  when  “localized”  kernels  such 
as  the  Gaussian  kernel  (6)  is  used.  Through  the  experimental  evaluation  in  Section  3,  we 
found  that  this  heuristic  model  simplification  does  not  degrade  the  classification  accuracy, 


2The  number  of  parameters  in  LSPC  with  model  (5)  is  cn,  while  the  number  of  parameters  in  KLR  is 
(c  —  1  )n  since  the  normalization  (‘sum-to-one’)  constraint  is  incorporated  in  the  training  phase. 
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but  reduces  the  computation  time. 

It  is  straightforward  to  show  that  solutions  for  all  regularization  parameter  values  (i.e., 
the  regularization  path,  see  [9,  14])  can  be  computed  efficiently  in  LSPC.  Let  us  consider 

the  eigendecomposition  of  the  matrix  H  (see  Eq.(9)): 

Tly 

h'"'  =  Y,  TW’W’h 

e= i 


where  are  the  eigenvectors  of  H  associated  with  the  eigenvalues  ■  Then, 

the  solution  ot(y'1  can  be  expressed  as 


^  (y) 
ory)  = 


(HW  +  XI  n 


\-l 


ny  y~T  , 

My)  \  ^  h 

h  -e—m 

e=i  11 


Since  (h  t/^)^  is  common  to  all  A,  we  can  compute  the  solution  ol^  for  all  A  efficiently 

- — -( y ) 

by  eigendecomposing  the  matrix  H  once  in  advance.  Although  eigendecomposition  of 
— (y) 

H  may  be  computationally  slightly  more  demanding  than  solving  a  system  of  linear 
equations  of  the  same  size,  this  approach  would  be  useful,  e.g.,  when  computing  the 
solutions  for  various  values  of  A  in  the  cross-validation  procedure. 

When  riy  is  large,  we  may  further  reduce  the  computational  cost  and  memory  space 
by  using  only  a  subset  of  kernels. 

q(y\x;oc)  =  'Y^a^)K(x,c^\ 

i=i 

^  i  n 

K(Xi> 

i= 1 


where  by  is  a  constant  chosen  to  be  smaller  than  ny  and  is  a  subset  of 

This  would  be  a  useful  heuristic  when  a  huge  number  of  samples  are  used  for  training. 

Another  option  for  reducing  the  computation  time  when  the  number  of  samples  is  very 
large  would  be  the  stochastic  gradient  descent  method  [1].  That  is,  starting  from  some 
initial  parameter  value,  gradient  descent  is  carried  out  only  for  a  randomly  chosen  single 
sample  in  each  iteration.  Since  our  optimization  problem  is  convex,  convergence  to  the 
global  solution  is  guaranteed  (in  a  probabilistic  sense)  by  stochastic  gradient  descent. 

We  focused  on  using  the  delta  kernel  for  class  labels  (see  Section  2.3).  We  expect  that 
designing  appropriate  kernel  functions  for  class  labels  would  be  useful  for  improving  the 
classification  performance,  e.g.,  in  the  context  of  multi-task  learning  [4,  2,  21],  We  will 
pursue  this  direction  in  our  future  work. 
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Abstract 

A  phenomenon  often  found  in  session-to-session  transfers  of  Brain  Computer 
Interfaces  (BCIs)  is  non-stationarity.  It  can  be  caused  by  fatigue  and  changing  at¬ 
tention  level  of  the  user,  differing  electrode  placements,  varying  impedances,  among 
other  reasons.  Covariate  shift  adaptation  is  an  effective  method  which  can  adapt 
to  the  testing  sessions  without  the  need  for  labeling  the  testing  session  data.  The 
method  was  applied  on  a  BCI  Competition  III  dataset.  Results  showed  that  covari¬ 
ate  shift  adaptation  compares  favorably  with  methods  used  in  the  BCI  competition 
in  coping  with  non-stationarities.  Specifically,  bagging  combined  with  covariate  shift 
helped  to  increase  stability,  when  applied  to  the  competition  dataset.  An  online  ex¬ 
periment  also  proved  the  effectiveness  of  bagged  covariate  shift  method.  Thus,  it 
can  be  summarized  that  covariate  shift  adaptation  is  helpful  to  realize  adaptive  BCI 
systems. 
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brain-computer  interface,  covariate  shift  adaptation,  bagging 
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1  Introduction 

A  Brain  Computer  Interface  (BCI)  is  a  novel  augmentative  tool  which  allows  a  user 
to  express  his  or  her  will  without  muscle  exertion,  provided  that  the  brain  signals  are 
translated  properly.  However,  it  may  be  difficult  to  recognize  the  electroencephalography 
(EEG)  patterns  under  a  fixed  algorithm  because  of  high  non-stationarity  of  the  EEG 
signals.  The  factors  causing  non-stationarity  include  changes  in  user  attention  level,  user 
fatigue,  and  small  differences  in  electrode  position  [1],  One  notable  representation  of  non- 
stationarity  is  that  EEG  feature  distributions  change  from  one  session  to  another,  which 
illustrates  the  non- stationary  nature  of  the  BCI  signal  and  provides  a  rationale  for  the 
design  of  an  adaptive  BCI  system  [2], 

Moreover,  a  good  BCI  system  should  be  bi-directional  in  communication  with  the 
user.  Besides  providing  visual/auditory  feedback  to  a  user,  the  system  should  be  able  to 
adapt  to  the  user,  possibly  with  an  adaptive  translation  algorithm.  Several  studies  have 
been  conducted  on  adaptive  BCI  systems  with  positive  results.  Vidaurre  et  al.  adopted 
an  online  updated  classifier  by  adaptive  estimation  of  the  information  matrix  (ADIM)  as 
well  as  an  adaptive  LDA  with  Kalman  filtering  [3,  4].  Blumberg  et  al.  developed  Adaptive 
Linear  Discriminant  Analysis,  updating  mean  values  and  covariances  continuously  in  time 
for  different  motor  imaginary  tasks  [5].  However,  most  of  the  adaptive  methods  are  based 
on  supervised  learning  techniques  (e.g.,  [1,  3,  4]),  which  need  labeled  test  samples  and  are, 
thus,  costly.  Covariate  shift  adaptation  is  a  method  which  can  overcome  this  shortcoming, 
assuming  that  the  input  distributions  of  training  and  testing  sessions  are  different  while 
the  conditional  distribution  of  output  given  input  remains  unchanged  [6].  Nevertheless, 
the  plain  covariate  shift  adaptation  technique  is  rather  unstable  due  to  large  variances. 

To  cope  with  this  problem,  we  propose  a  novel  method  which  combines  covariate  shift 
adaptation  and  bagging  [7]  [8].  Through  applications  on  benchmark  data,  we  demonstrate 
the  effectiveness  of  the  proposed  approach. 


2  Methods 

In  this  section,  we  review  baseline  methods  as  well  as  our  proposed  approach. 

2.1  Feature  Extraction  by  CSP  and  the  Baseline  Classifier  LDA 

Common  Spatial  Patterns  (CSP)  is  one  of  the  most  popular  spatial  Liters  of  multi-channel 
EEG-based  BCI  in  recent  years.  In  contrast  to  other  spatial  Liters,  CSP  generates  fea¬ 
tures  ready  to  be  fed  into  the  classiLer.  After  band-pass  Lltering  the  EEG  signals  in 
the  frequency  range  of  interest,  high  or  low  signal  variance  reLects  strong  or  attenuated 
rhythmic  activity,  respectively  [9].  When  classifying  EEG  into  two  tasks,  CSP  maximizes 
the  variance  of  one  class  while  minimizing  the  variance  of  the  other  and,  thus,  reLects 
the  task  speciLc  activation  patterns.  Some  example  of  CSP  applications  can  be  found  in 
[9,  10,  11,  12], 
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Linear  Discriminant  Analysis  (LDA)  is  a  popular  classification  method  in  BCI  applica¬ 
tion  [13].  LDA  can  be  realized  by  a  linear  least-squares  method  if  the  target  labels  {yi\f=l 
corresponding  to  the  feature  vectors  {xi]^=l  for  class  C\  are  set  to  be  —  (W  +  N2)/Ni  and 
the  target  labels  of  class  C2  are  set  to  (N\  +  W)/-^,  where  N\  and  N2  are  numbers  of 
samples  of  classes  C\  and  C'2,  respectively.  More  specifically,  for  a  linear  model 

d 

/(a;;d)=do  +  ^^W, 

i— 1 

where  is  the  ith  element  of  an  d-dimensional  feature  vector  x,  the  parameters  6  are 
learned  by  the  least-squares  method: 


mm  “  f(x*’ 6 )) 

i= 1 

The  least-squares  solution  is  given  as 

Olda  =  (■ XTX)~1XTy , 

where 


/  1 

xf  \ 

1 

T 

X2 

X  = 

l1 

XN  ) 

y  —  (2/1, 2/2, ... ,Vn ),  and  XT  denotes  the  transpose  of  X. 

2.2  Covariate  Shift  Adaptation  by  IWLDA 

Covariate  shift  is  defined  as  the  situation  where  the  training  input  points  and  test  input 
points  follow  different  distributions  while  the  conditional  distribution  of  output  values 
given  input  points  is  unchanged  [6].  A  prime  example  of  covariate  shift  in  EEG-based 
BCIs  occurs  when,  given  different  experimental  sessions  of  the  same  imaginary  tasks, 
event-related  synchronization/desynchronization  cortical  distributions  remain  unchanged, 
but  the  means  and  variances  shift  in  the  feature  distribution  for  each  task. 

Under  covariate  shift,  ordinary  Linear  Discriminant  Analysis  (LDA)  is  not  consistent 
[14,  6],  i.e.,  even  when  infinitely  many  training  samples  are  provided,  one  cannot  obtain  the 
optimal  solution.  To  cope  with  this  problem,  Importance  Weighted  Linear  Discriminant 
Analysis  (IWLDA)  was  proposed  [15,  6]. 

IWLDA  is  an  extension  of  LDA  based  on  the  concept  of  importance  sampling.  The 
importance  is  defined  as  the  ratio  of  test  and  training  input  densities: 
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After  the  introduction  of  the  importance  and  a  regularizer,  the  parameters  are  learned  as 

n  2 

mm  fa  ~  +  ^ll^ll2, 

i= 1 

where  A  (>  0)  is  the  regularization  parameter.  The  IWLDA  solution  is  given  by 

OiWLDA  =  (XtDX  +  A  I)~1XTDy, 

where  D  is  the  diagonal  matrix  with  the  i-th  diagonal  element  Dl  %  =  w[xi )  and  /  is  the 
identity  matrix.  IWLDA  is  proved  to  be  consistent  even  in  the  presence  of  covariate  shift. 


2.3  Model  Selection  by  IWCV 


The  IWLDA  method  contains  a  regularization  parameter  A  and  this  needs  to  be  chosen  ap¬ 
propriately  for  obtaining  better  performance.  To  this  end,  cross-validation  is  commonly 
used,  which  is  known  to  be  an  unbiased  estimator  of  the  generalization  error.  How¬ 
ever,  ordinary  cross-validation  is  no  longer  unbiased  in  the  presence  of  covariate  shift; 
importance-weighted  cross  validation  (IWCV)  is  instead  unbiased  under  covariate  shift 
[6], 

More  specifically,  we  first  divide  the  training  samples  {zt  \  Zi  =  (xi,yi)}f=1  into  k 
disjoint  subsets  {Zr}^=1  (we  use  k  —  5  in  the  experiments).  Then  the  parameter  9r  is 
obtained  using  {Zj}j^r  (i.e.,  without  Zr)  by  IWLDA  and  its  mean  test  error  for  the 
remaining  samples  Zr  is  computed: 


^2  w(x) loss  (7(x;  §r),yj 

(x,y)cZr 
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where 


loss  (y,  y) 


|  (1  — sign)  Classification 

(y  —  y)2  Regression 


We  repeat  this  procedure  for  r  =  1,  2, . . . ,  k  and  choose  the  regularization  parameter 
A  so  that  the  average  of  the  above  mean  test  error  over  all  r  is  minimized. 


2.4  Direct  Importance  Estimation  by  KLIEP  or  uLSIF 

For  computing  the  IWLDA  solution  and  performing  model  selection  by  IWCV,  the  values 
of  the  importance  are  required,  which  are  usually  unknown.  A  naive  approach  to  impor¬ 
tance  estimation  would  be  to  first  estimate  the  training  and  testing  densities  separately 
from  training  input  samples  and  testing  input  samples  then  estimate 

the  importance  by  taking  the  ratio  of  the  estimated  densities.  However,  density  estima¬ 
tion  is  known  to  be  a  difficult  problem,  particularly  in  high- dimensional  cases.  Therefore, 
this  naive  approach  may  not  be  effective;  directly  estimating  the  importance  without 
estimating  the  densities  would  be  more  promising  [15]. 
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2.4.1  KLIEP 


KLIEP  (Kullback-Leibler  Importance  Estimation  Procedure)  is  a  method  to  estimate  the 
importance  directly.  First,  the  importance  is  modeled  as 


where  {cq}b=1  are  coefficients  to  be  learned  (cq  >  0  for  l  —  1,  2, . . . ,  b),  {ci}bl=l  are  chosen 
randomly  from  {.x')e}ht1e,  and  the  number  of  parameters  is  set  to  b  =  min(100,nte)  in  the 
experiments.  The  kernel  width  cr  can  be  optimized  by  cross  validation  (see  [15]). 

Using  the  above  importance  model,  we  can  obtain  an  estimate  of  the  test  input  density 
as 

Pte(x)  =  w(x)ptr(x). 

Based  on  this  expression,  {oq}bl=1  are  determined  so  that  the  Kullback-Leibler  divergence 
from  pte(x )  to  pte(x)  is  minimized. 


KL\pte(x)\\pte(x)\  =  /  pte(x)  log 


Pte{X) 


ID 


w(x)ptr{x 


-dx 


Pte(x)  log 


id 


Pte{X) 
Ptr{x ) 


dx—  pte(x)  log  w(x)dx. 


ID 


Based  on  this,  the  optimization  criterion  of  KLIEP  is  given  as  follows  (see  [15]  for  details): 


max 

Kd  Li 


nte 

5>g 


3= 1 


'  b 

Y,  on  exp 
.1=1 


fp 

Xj  —  Cl 

2cr2 


subject  to 


ntr  b 

EE  ai  exp 

i=  1  1=1 


X?-Cl  II2 

2a2 


ritr  and  ci i,  a 2, . . . ,  on,  >  0. 


2.4.2  uLSIF 

uLSIF  (unconstrained  Least-Squares  Importance  Fitting)  [16]  also  estimates  the  impor¬ 
tance  directly.  The  modeling  of  the  importance  is  the  same  as  Equation  (1),  but  the 
parameters  are  determined  by  minimizing  the  squared  error: 

J0(«)  =  \  [  (w(x)  -  ptr(x)dx 

"  J  \  PtryX)  / 

=  ^  J  w(x)2ptr(x)dx  -  J  w(x)pte(x)dx  +  C, 

where  C  —  If  w(x)pte(x)dx  is  a  constant  and  thus  can  be  ignored.  As  presented  in  detail 
in  [16],  the  solution  of  uLSIF  is  given  by 

61  =  max(0fe,  ( H  +  A J)_1h), 
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where 


Hlv 

k 


1 

Tltr 


Tltr 

J^exp 

i=  1 


l 

nte 


nte 

J^exp 


3= 1 


xf  ~  CV 

2a2 


5 


2.5  Bagged  IWLDA 

IWLDA  combined  with  KLIEP  or  uLSIF  is  shown  to  perform  well  under  covariate  shift. 
However,  a  weakness  of  this  approach  is  that  IWLDA  can  still  produce  a  large-variance 
estimator,  causing  instability. 

To  ease  this  problem,  we  propose  Bagged  Importance  Weighted  Linear  Discriminant 
Analysis  (BIWLDA)  which  combines  bagging  (short  for  ”  Bootstrap  aggregating” )  [7]  [18] 
and  IWLDA  (with  A  chosen  by  IWCV)  to  improve  the  stability  of  classifiers. 

Bagging  is  the  parallel  approach  to  ensemble  construction,  which  combines  indepen¬ 
dently  constructed  accurate  and  diverse  base  learners  [17].  The  idea  behind  bagging  is 
that  averaging  the  predictions  will  lead  to  the  improvement  of  classification  accuracy, 
particularly  variance  reduction.  Since  plain  covariate-shift  adaptation  methods  tend  to 
produce  high-variance  estimators,  combining  them  with  bagging  would  be  promising. 

More  specifically,  the  proposed  BIWLDA  procedure  is  summarized  as  follows: 

1.  Randomly  take  M  trials  out  of  the  whole  A-sized  training  set,  with  M  =  0.8V ; 

2.  Train  IWLDA  (with  A  chosen  by  IWCV)  on  the  re-sampled  training  set; 

3.  Repeat  1)  and  2)  for  30  times; 

4.  Average  the  30  predictors. 

The  classifiers  realized  with  KLIEP  with  and  without  bagging  are  named  BIWLDA1 
and  IWLDAl  respectively,  while  the  classifiers  realized  with  uLSIF  with  and  without 
bagging  are  named  BIWLDA2  and  IWLDA2  respectively. 


3  Experiments  on  BCI  Competition  III  Dataset  IVc 

In  this  section,  we  show  the  experimental  results  on  BCI  Competition  III  Dataset  IVc. 

3.1  Dataset 

Dataset  IVc  [19]  in  BCI  competition  III  was  recorded  from  one  healthy  subject.  Visual 
cues  of  3.5  seconds  indicated  which  of  the  following  3  motor  imageries  the  subject  should 
perform:  left  hand,  right  foot  and  tongue.  For  training,  210  trials  were  provided  with 
labels  of  left  hand  respectively  right  hand.  420  test  trials  were  recorded  4  hours  after 
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•  training  left  hand 
o  training  right  foot 
o  testing  right  foot 

♦  testing  left  hand 


-4 


Figure  1:  Different  feature  distributions  between  training  and  testing  sessions  in  BCI 
Competition  III  Dataset  IVc. 


the  training  sessions.  The  testing  sessions  were  similar  to  the  training  sessions,  but  the 
motor  imagery  had  to  be  performed  for  1  second  only,  compared  to  the  3.5  seconds  in 
the  training  sessions.  The  other  difference  was  that  the  class  tongue  was  replaced  by  the 
class  relax. 

118  EEG  channels  were  measured  at  positions  of  the  extended  international  10-20 
system.  Signals  were  band-pass  filtered  from  0.05  to  200  Hz  and  then  digitized  at  1000 
Hz  with  16  bit  accuracy.  The  data  downsampled  to  100  Hz  was  used  for  analysis. 

Since  left  hand  and  right  foot  imagery  tasks  were  both  included  in  the  training  and 
testing  sessions,  and  these  two  sessions  had  a  long  time  interval  in  between,  checking 
these  two  classes  would  reveal  whether  there  is  a  different  feature  distribution  between 
two  sessions. 

3.2  Investigation  of  Feature  Distributions  and  Improved  Algo¬ 
rithms 

It  has  been  shown  in  many  previous  studies  that  filtering  must  precede  CSP  in  order  to 
make  CSP  optimal  for  the  separation  of  two  classes.  After  plotting  and  observing  the 
power  spectrum,  we  decided  to  apply  only  bandpass-filtering,  from  12  to  14  Hz,  though 
the  competition  winner  considered  a  broader  bandpass-filtering  [18].  Also  the  competition 
winner  claimed  the  optimal  dimension  of  CSP  was  three,  and  this  result  was  verified  by 
us. 

By  plotting  the  features  extracted  by  CSP  for  left  hand  and  right  foot  imaginary 
movements  (see  Figure  1),  it  can  be  seen  that  a  different  feature  distribution  did  occur 
and  that  there  was  a  need  to  shift  the  classification  boundary.  Note  that,  for  ease  in 
visualization,  only  two  dimensions  of  calculated  features,  which  correspond  to  the  two 
most  important  CSP  filters  from  the  training  set,  were  drawn.  Figure  2  shows  the  full 
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(a) 


(b) 

Figure  2:  (a)  Flowchart  of  the  first  winner;  (b)  Flowchart  of  the  covariate  shift  methods. 


Table  1:  Testing  results  of  LDA,  IWLDA,  BLDA  and  B1WLDA. 


Method 

LDA 

IWLDAl 

IWLDA2 

BLDA 

BIWLDAl 

BIWLDA2 

MSE  (mean) 

0.246 

0.1726 

0.1183 

0.519 

0.0994 

0.1165 

MSE(std) 

0 

0.0563 

0.0014 

0.6023 

0.0142 

0.0346 

(a)  MSEs  by  LDA  (baseline),  IWLDA,  BLDA  and  B1WLDA. 


Method 

LDA 

IWLDAl 

IWLDA2 

BLDA 

BIWLDAl 

BIWLDA2 

Accuracy  (mean) 

0.8429 

0.9336 

0.9539 

0.8115 

0.965 

0.9546 

Accuracy  (std) 

1.17E-16 

0.035 

0.0011 

0.1322 

0.0098 

0.0343 

(b)  Accuracy  by  LDA  (baseline),  IWLDA,  BLDA  and  BIW 


DA. 


two  task  classification  process  of  the  Erst  winner  as  well  as  our  algorithm.  The  main 
difference  lies  in  the  replacement  of  LDA  by  IWLDA  or  BIWLDA. 

3.3  Results 

Table  1  shows  the  testing  results  of  all  methods  with  the  same  data  preprocessing.  The 
means  and  standard  deviations  were  based  on  ten  iterations  of  testing.  From  Table  2(a), 
it  can  be  seen  that  the  covariate  shift  adaptation  methods  worked  very  well.  Among  them, 
BIWLDAl  proved  to  be  much  more  stable  than  IWLDAl,  while  IWLDA2  and  BIWLDA2 
were  comparable  to  each  other.  However,  in  real  application,  normalization  of  the  outputs 
is  impossible.  Furthermore,  as  an  additional  evaluation  criterion,  classification  accuracy 
was  also  calculated,  as  shown  in  Table  2(b). 

It  may  be  not  appropriate  for  us  to  claim  that  our  methods  worked  better  than  the 
method  of  competition  winner,  since  this  dataset  contained  three  classes,  and  our  methods 
only  worked  better  in  separating  two  of  them.  However,  we  think  our  method  solved  the 
non-stationarity  problem  caused  by  session-to-session  transfer  more  efficiently,  which  can 
be  revealed  from  the  accuracy  listed  in  table  2(b). 

When  estimating  the  importance,  parameter  b  was  established  as  mm(100,  nte)  (see 
section  2.4.1),  where  nte  is  the  number  of  testing  trials.  100  trials  were  randomly  chosen 
from  the  testing  set  in  cases  where  nte  was  greater  than  100.  To  determine  the  effects 
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Figure  3:  The  first  40,  80,...,  280  trials  were  taken  into  importance  estimation,  with 
IWLDA1,  IWLDA2,  BIWLDAl  and  BIWLDA2. 

brought  on  by  nte,  we  tested  with  the  first  40,  80,...,  240  and  all  280  trials  (with  training 
set  unchanged),  which  may  be  seen  as  a  pseudo-online  importance  estimation  scenario. 
The  testing  was  repeated  10  times  with  four  covariate  shift  methods,  and  the  averaged 
accuracy  was  plotted  in  Figure  3.  It  can  be  concluded  that  BIWLDAl  is  the  most  stable 
method  for  different  numbers  of  testing  trials  taken  for  importance  estimation. 

3.4  Online  application  of  bagged  covariate  shift  method 

From  its  application  on  the  benchmark  dataset,  it  is  not  difficult  to  see  that  BIWLDAl 
performed  well  in  terms  of  both  accuracy  and  stability.  Moreover,  we  wished  to  test 
its  effectiveness  in  real  online  applications  and,  thus,  performed  an  online  experiment 
on  three  healthy  female  subjects  (age  38,  23,  30).  For  the  experiment,  we  used  a  G.tec 
USBamp  system  controlled  with  the  software  BCI2000  [20].  EEG  was  recorded  using  10 
or  15  electrodes  positioned  at  locations  FC3,  FC i,  FCZ,  FC2,  FC 4,  C3,  C i,  Cz,  C'2,  C'4, 
CP3,  CPi,  CPZ ,  CP2,  and  CP4,  of  the  international  10-20  system.  For  subject  3  the  former 
10  channels  were  installed.  Data  was  sampled  at  256  Hz  and  the  feedback  was  updated 
every  1  second.  Pre-feedback  of  each  trial  was  set  as  2  seconds  and  the  feedback  time 
length  was  decided  as  3  seconds. 

For  the  online  experiment,  a  ball  was  displayed  traveling  at  a  constant  speed  from  the 
left  to  the  right  of  a  screen.  Vertical  position  (distance  from  the  midline  of  the  screen)  of 
the  ball  served  as  feedback,  changing  according  to  the  classification  output  of  the  previous 
second.  Subjects  were  asked  to  imagine  moving  their  left  hand  or  both  hands  and  both 
feet  to  direct  the  ball  downwards  and  upwards,  respectively,  and  position  it  to  hit  a  target 
bar  at  the  right  of  the  screen. 

The  experiment  was  carried  out  in  two  parts,  separated  by  one  or  two  days.  In  the 
first  part,  the  subjects  were  trained  to  gain  familiarity  with  both  offline  and  online  exper¬ 
iments,  obtaining  trial  accuracies  above  80%.  The  algorithm  was  two  most  discriminative 
features  casted  by  CSP  and  classified  with  LDA.  In  the  second  part,  only  online  experi- 
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Table  2:  Mutual  information  estimated  before  and  after  BIWLDAl  updated 


Session  before  BIWLDAl 

Sessions  after  BIWLDAl 

Ml(old) 

Ml(updated) 

MI  (old) 

Ml(updated) 

subject  2 

0.3052 

0.3527 

0.3452 

0.3660 

subject  3 

0.1005 

0.1679 

0.1511 

0.1789 

rnents  were  conducted,  and  the  subjects  were  given  a  few  minutes  to  practice  before  the 
experiment  started.  After  the  first  session,  BIWLDAl  was  run  to  adjust  the  LDA  coef¬ 
ficients  according  to  the  session  transfer  from  the  best  performed  online  session  recorded 
on  the  previous  day.  After  running  BIWLDAl,  another  two  sessions  of  experiments  were 
continued.  Each  online  session  consisted  of  43  trials,  and  when  running  BIWLDAl,  three 
1  second  (namely  second  2,  3,  4  in  one  trial)  non-overlapped  windows  were  cut  from  each 
trial  which  means  129  one-second  samples  were  obtained. 

The  trial  accuracy  was  improved  from  72.09%  to  80.23%  (subject  2)  ,  and  from  66% 
to  78%  (subject  3)  after  adjustment  of  LDA  coefficients.  Results  of  only  two  subjects  are 
presented  here  because  subject  1  reached  the  same  trial  accuracy  as  the  previous  day  at 
83%,  showing  no  sign  of  non-stationarities.  In  order  to  verify  that  these  improvements 
were  not  due  to  the  learning  process  itself,  we  applied  the  coefficients  before  and  after 
BIWLDAl  adjustment  to  all  the  online  sessions,  analyzing  the  mutual  information  [21] 
between  the  targets  and  the  outputs  of  online  data  (still  129  samples  per  session).  Using 
mutual  information  as  the  evaluation  criteria  is  natural  because  it  takes  not  only  the 
sign  of  the  output  into  account  but  also  the  amplitude,  which,  in  turn,  is  used  to  set  the 
distance  between  the  ball  and  vertical  midline  of  the  screen. 

From  Table  2,  it  can  be  concluded  that  these  improvements  cannot  be  attributed  to 
the  learning  process  because  in  sessions  either  before  or  after  the  BIWLDAl  adjustment, 
the  updated  coefficients  generated  higher  mutual  information  and  with  values  that  are 
quite  similar.  Note  in  Table  2  that  because  the  session  before  BIWLDAl  used  old  coeffi¬ 
cients,  the  numbers  of  MI  (old)  are  written  in  bold,  as  is  MI  (updated)  in  the  session  after 
BIWLDAl  adjustment. 

Figure  4  gives  a  more  direct  description  about  the  session-to-session  transfer  of  fea¬ 
ture  distribution  with  subject  2.  In  it  the  training  tasks  meant  the  tasks  from  the  best 
performed  session  on  the  previous  day;  the  testing  tasks  referred  to  those  performed  in 
the  first  on-line  session  on  the  following  day,  which  were  classified  more  accurately  after 
an  adjustment.  Although  the  session-to-session  transfer  phenomenon  was  not  particularly 
obvious  with  subject  3  as  figure  5(a)  shows,  an  adjustment  resulted  in  an  increase  of  accu¬ 
racy  shown  in  figure  5(b).  Adjusting  the  classifier  may  also  help  the  subject  get  inspired 
with  more  controllable  status,  because  the  online  experiment  always  involves  intricate 
interaction  between  feedback  and  the  subject. 
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Figure  4:  Session  to  session  transfer  phenomenon  in  subject  2  and  classification  boundary 
updating 

4  Discussions  and  Conclusions 

In  order  to  test  the  effectiveness  of  covariate  shift  adaptation  schemes  on  a  BCI,  six 
classifiers,  namely  LDA,  IWLDAl  and  IWLDA2,  BLDA,  BIWLDAl  and  BIWLDA2, 
were  applied  to  Dataset  IVc  of  the  BCI  Competition  III.  From  the  results,  we  arrive  at 
the  following  conclusions. 

Lacking  detailed  descriptions  regarding  the  experimental  protocol  of  the  two  sessions  in 
this  dataset,  we  theorize  that  the  non-stationarity  originated  from  three  aspects.  First,  if 
electrodes  were  reinstalled  between  sessions,  slight  differences  in  electrode  placement  may 
have  caused  shifts  of  the  data  in  the  feature  space.  If  no  reinstallation  of  electrodes  was 
performed,  it  is  also  possible  that  the  electrode  gel  dried  after  four  hours,  causing  varying 
impedances.  Second,  the  long  breaks  between  runs  may  also  have  affected  performance. 
Although  in  this  dataset  a  good  performance  level  was  maintained,  this  is  a  normal 
occurrence  in  BCI  experiments.  An  example  was  given  in  [22],  where  one  of  the  breaks 
coincided  with  the  end  of  a  phase  with  good  performance.  Therefore,  it  is  possible  that, 
upon  resuming  the  experiment,  the  subject  was  unable  to  regain  the  control  acquired 
in  the  previous  phase.  Third,  it  may  have  been  difficult  for  the  subjects  to  maintain  an 
adequate  attention  level  due  to  fatigue  or  the  learning  process  itself.  Shenoy  et  al.  [22]  also 
pointed  out  in  their  study  that  the  non-stationarity  was  due  to  different  background  EEG 
activities  brought  on  by  the  introduction  of  visual  feedback  during  the  online  feedback 
session.  In  our  case,  however,  this  cannot  be  considered  to  be  a  reason  since  the  experiment 
setup  remained  unchanged  between  sessions. 

In  [22],  two  possible  ways  of  adaptation  were  also  discussed,  namely  shifting  and 
rotating  the  boundary.  The  results  of  our  study  demonstrated  that  when  there  is  a  need 
for  shifting  or  rotation,  covariate  shift  methods  are  effective  in  adaptation. 

Overall,  covariate  shift  adaptation  was  shown  to  be  effective  for  improving  the  classi¬ 
fication  accuracy  when  the  feature  distributions  differ  from  one  session  to  another.  Es¬ 
pecially  when  combined  with  bagging,  even  a  small  number  of  testing  trials  will  result  in 


x  training  task  1 
o  training  task  2 
testing  task  1 
o  testing  task  2 
-  -  old  boundary 
updated  boundary 
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(a)  feature  distribution  of  subject  3,  both  sessions 


(b)  Second  session  feature  distribution  of  subject 
3  and  classification  boundary  updating 


Figure  5:  (a)  Session  to  session  transfer  phenomenon  in  subject  3,  and  (b)  for  a  clearer 
view,  the  2nd  session  (first  session  on  the  following  day)  was  plotted  separately 


an  accurate  importance  estimation. 

It  would  be  promising  to  integrate  the  proposed  algorithm  into  a  BCI  system,  where 
adaptation  would  be  run  at  the  beginning  of  every  session.  For  this  purpose,  we  designed 
an  online  experiment  and  proved  the  effectiveness  of  BIWLDAl. 

LDA  and  quadratic  discriminant  analysis  (QDA)  are  popular  classification  techniques, 
especially  when  adaptation  is  involved,  due  to  their  effectiveness  and  simplicity.  Examples 
of  adapted  LDA/QDA  applications  can  be  found  in  [3,  4,  5]. 

Note  that  most  of  the  existing  adaptation  studies  focused  on  trial-to-trial  adaptation 
[3,  4,  5],  while  we  investigated  session-to-session  adaptation.  For  subjects,  who  have  little 
experience  with  online  experiments  and  may  easily  become  frustrated  with  incorrect  feed¬ 
back  results,  the  bagged-covariate  shift  method  is  helpful  in  reinforcing  their  confidence 
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by  making  slight  adjustments  to  the  settings  of  the  previous  day  and,  thus,  avoiding  the 
difficulties  of  offline  training  each  time  before  an  online  experiment. 
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Abstract 

Appropriately  designing  sampling  policies  is  highly  important  for  obtaining  better 
control  policies  in  reinforcement  learning.  In  this  paper,  we  first  show  that  the 
least-squares  policy  iteration  (LSPI)  framework  allows  us  to  employ  statistical  ac¬ 
tive  learning  methods  for  linear  regression.  Then  we  propose  a  design  method  of 
good  sampling  policies  for  efficient  exploration,  which  is  particularly  useful  when 
the  sampling  cost  of  immediate  rewards  is  high.  The  effectiveness  of  the  proposed 
method,  which  we  call  active  policy  iteration  (API),  is  demonstrated  through  sim¬ 
ulations  with  a  batting  robot. 


Keywords 

reinforcement  learning,  Markov  decision  process,  least-squares  policy  iteration,  ac¬ 
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1  Introduction 

Reinforcement  learning  (RL)  is  the  problem  of  letting  an  agent  learn  intelligent  behavior 
through  trial-and-error  interaction  with  unknown  environment  (Sutton  &  Barto,  1998). 
More  specifically,  the  agent  learns  its  control  policy  so  that  the  amount  of  rewards  it  will 
receive  in  the  future  is  maximized.  Due  to  its  potential  possibilities,  RL  has  attracted  a 
great  deal  of  attention  recently  in  the  machine  learning  community. 
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In  practical  RL  tasks,  it  is  often  expensive  to  obtain  immediate  reward  samples  while 
state-action  trajectory  samples  are  readily  available.  For  example,  let  us  consider  a  robot- 
arm  control  task  of  hitting  a  ball  by  a  bat  and  driving  the  ball  as  far  away  as  possible 
(see  Figure  9).  Let  us  adopt  the  carry  of  the  ball  as  the  immediate  reward.  In  this 
setting,  obtaining  state-action  trajectory  samples  of  the  robot  arm  is  easy  and  relatively 
cheap  since  we  just  need  to  control  the  robot  arm  and  record  its  state-action  trajectories 
over  time.  On  the  other  hand,  explicitly  computing  the  carry  of  the  ball  from  the  state- 
action  samples  is  hard  due  to  friction  and  elasticity  of  links,  air  resistance,  unpredictable 
disturbances  such  a  current  of  air,  and  so  on.  Thus,  in  practice,  we  may  have  to  put  the 
robot  in  open  space,  let  the  robot  really  hit  the  ball,  and  measure  the  carry  of  the  ball 
manually.  Thus  gathering  immediate  reward  samples  is  much  more  expensive  than  the 
state-action  trajectory  samples. 

When  the  sampling  cost  of  immediate  rewards  is  high,  it  is  important  to  design  the 
sampling  policy  appropriately  so  that  a  good  control  policy  can  be  obtained  from  a  small 
number  of  samples.  So  far,  the  problem  of  designing  good  sampling  policies  has  been  ad¬ 
dressed  in  terms  of  the  trade-off  between  exploration  and  exploitation  (Kaelbling,  Littman, 
&  Moore,  1996).  That  is,  an  RL  agent  is  required  to  determine  either  to  explore  new  states 
for  learning  more  about  unknown  environment  or  to  exploit  previously  acquired  knowledge 
for  obtaining  more  rewards. 

A  simple  framework  of  controlling  the  exploration-exploitation  trade-off  is  the  e-greedy 
policy  (Sutton  &  Barto,  1998) — with  (small)  probability  e,  the  agent  chooses  to  explore 
unknown  environment  randomly;  otherwise  it  follows  the  current  control  policy  for  ex¬ 
ploitation.  The  choice  of  the  parameter  e  is  critical  in  the  e-greedy  policy.  A  standard 
and  natural  idea  would  be  to  decrease  the  probability  e  as  the  learning  process  progresses, 
i.e.,  the  environment  is  actively  explored  in  the  beginning  and  then  the  agent  tends  to  be 
in  the  exploitation  mode  later.  However,  theoretically  and  practically  sound  methods  for 
determining  the  value  of  e  seem  to  be  still  open  research  topics.  Also,  when  the  agent  de¬ 
cides  to  explore  unknown  environment,  merely  choosing  the  next  action  randomly  would 
be  far  from  the  best  possible  option. 

An  alternative  strategy  called  Explicit  Explore  or  Exploit  (E3)  was  proposed  in  Kearns 
&  Singh  (1998)  and  Kearns  &  Singh  (2002).  The  basic  idea  of  E3  is  to  control  the 
balance  between  exploration  and  exploitation  so  that  the  accuracy  of  environment  model 
estimation  is  optimally  improved.  More  specifically,  when  the  number  of  known  states 
is  small,  the  agent  actively  explores  unvisited  (or  less  visited)  states;  as  the  number  of 
known  states  increases,  exploitation  tends  to  be  prioritized.  The  E3  strategy  is  efficiently 
realized  by  an  algorithm  called  R-max  (Brafman  &  Tennenholtz,  2002;  Strehl,  Diuk, 
&  Littman,  2007).  R-max  assigns  the  maximum  ‘value’  to  unknown  states  so  that  the 
unknown  states  are  visited  with  high  probability.  An  advantage  of  E3  and  R-max  is  that 
the  polynomial-time  convergence  (with  respect  to  the  number  of  states)  to  a  near-optimal 
policy  is  theoretically  guaranteed.  However,  since  the  algorithms  explicitly  count  the 
number  of  visits  at  every  state,  it  is  not  straightforward  to  extend  the  idea  to  continuous 
state  spaces  (Li,  Littman,  &  Mansley,  2008).  This  is  a  critical  limitation  in  robotics 
applications  since  state  spaces  are  usually  spanned  by  continuous  variables  such  as  joint 
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angles  and  angular  velocities. 

In  this  paper,  we  address  the  problem  of  designing  sampling  policies  from  a  differ¬ 
ent  point  of  view — active  learning  (AL)  for  value  function  approximation.  We  adopt  the 
framework  of  least-squares  policy  iteration  (LSPI)  (Lagoudakis  &  Parr,  2003)  and  show 
that  statistical  AL  methods  for  linear  regression  (Fedorov,  1972;  Cohn,  Ghahramani,  & 
Jordan,  1996;  Wiens,  2000;  Kanamori  &  Shimodaira,  2003;  Sugiyama,  2006;  Sugiyama 
&  Nakajima,  2009)  can  be  naturally  employed.  In  the  LSPI  framework,  the  state-action 
value  function  is  approximated  by  fitting  a  linear  model  with  least-squares  estimation.  A 
traditional  AL  scheme  (Fedorov,  1972;  Cohn  et  ah,  1996)  is  designed  to  find  the  input 
distribution  such  that  the  variance  of  the  least-squares  estimator  is  minimized.  For  justi¬ 
fying  the  use  of  the  traditional  AL  scheme,  the  bias  should  be  guaranteed  not  to  increase 
when  the  variance  is  reduced,  since  the  expectation  of  the  squared  approximation  error 
of  the  value  function  is  expressed  as  the  sum  of  the  squared  bias  and  variance.  To  this 
end,  we  need  to  assume  a  strong  condition  that  the  linear  model  used  for  value  function 
approximation  is  correctly  specified,  i.e.,  if  the  parameters  are  learned  optimally,  the  true 
value  function  can  be  perfectly  approximated. 

However,  such  a  correct  model  assumption  may  not  be  fulfilled  in  practical  RL  tasks 
since  the  profile  of  value  functions  may  be  highly  complicated.  To  cope  with  this  problem, 
a  two-stage  AL  scheme  has  been  proposed  in  Kanamori  &  Shimodaira  (2003).  The  use  of 
the  two-stage  AL  scheme  can  be  theoretically  justified  even  when  the  model  is  misspeci- 
fied,  i.e.,  the  true  function  is  not  included  in  the  model.  The  key  idea  of  this  two-stage  AL 
scheme  is  to  use  dummy  samples  gathered  in  the  first  stage  for  estimating  the  approxima¬ 
tion  error  of  the  value  function;  then  additional  samples  are  chosen  based  on  AL  in  the 
second  stage.  This  two-stage  scheme  works  well  when  a  large  number  of  dummy  samples 
are  used  for  estimating  the  approximation  error  in  the  first  stage.  However,  due  to  high 
sampling  costs  in  practical  RL  problems,  the  practical  performance  of  the  two-stage  AL 
method  in  the  RL  scenarios  would  be  limited. 

To  overcome  the  weakness  of  the  two-stage  AL  method,  single-shot  AL  methods  have 
been  developed  (Wiens,  2000;  Sugiyama,  2006).  The  use  of  the  single-shot  AL  methods 
can  be  theoretically  justified  when  the  model  is  approximately  correct.  Since  dummy 
samples  are  not  necessary  in  the  single-shot  AL  methods,  good  performance  is  expected 
even  when  the  number  of  samples  to  be  collected  is  not  large.  Moreover,  the  algorithms 
of  the  single-shot  methods  are  very  simple  and  computationally  efficient.  For  this  reason, 
we  adopt  the  single-shot  AL  method  proposed  in  Sugiyama  (2006),  and  develop  a  new 
exploration  scheme  for  the  LSPI-based  RL  algorithm.  The  usefulness  of  the  proposed 
approach,  which  we  call  active  policy  iteration  (API),  is  demonstrated  through  batting- 
robot  simulations. 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2,  we  formulate  the  RL  prob¬ 
lem  using  Markov  decision  processes  and  review  the  LSPI  framework.  Then  in  Section  3, 
we  show  how  a  statistical  AL  method  could  be  employed  for  optimizing  the  sampling  pol¬ 
icy  in  the  context  of  value  function  approximation.  In  Section  4,  we  apply  our  AL  strategy 
to  the  LSPI  framework  and  show  the  entire  procedure  of  the  proposed  API  algorithm. 
In  Section  5,  we  demonstrate  the  effectiveness  of  API  through  ball-batting  simulations. 
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Finally,  in  Section  6,  we  conclude  by  summarizing  our  contributions  and  describing  future 
work. 

2  Formulation  of  Reinforcement  Learning  Problem 

In  this  section,  we  formulate  the  RL  problem  as  a  Markov  decision  problem  (MDP) 
following  Sutton  &  Barto  (1998),  and  review  how  it  can  be  solved  using  a  method  of 
policy  iteration  following  Lagoudakis  &  Parr  (2003). 

2.1  Markov  Decision  Problem 

Let  us  consider  an  MDP  specified  by 

(S,A,PT,R,l),  (!) 

where 

•  S  is  a  set  of  states, 

•  A  is  a  set  of  actions, 

•  Pt(-s/|s,o)  (g  [0,1])  is  the  conditional  probability  density  of  the  agent’s  transition 
from  state  s  to  next  state  s'  when  action  a  is  taken, 

•  R(s,  a,  s')  (e  M)  is  a  reward  for  transition  from  s  to  s'  by  taking  action  a, 

•  7  (e  (0, 1])  is  the  discount  factor  for  future  rewards. 

Let  7r(a|s)  (e  [0,1])  be  a  stochastic  policy  which  is  a  conditional  probability  density  of 
taking  action  a  given  state  s.  The  state-action  value  function  Qn(s,a)  (e  M)  for  policy 
7T  denotes  the  expectation  of  the  discounted  sum  of  rewards  the  agent  will  receive  when 
taking  action  a  in  state  s  and  following  policy  tt  thereafter,  i.e., 

Q*(s,a)=  E 

{Sn,an}^=2 

where  E{s„,a„}£i2  denotes  the  expectation  over  trajectory  {sn,  an}^L2  following  Pr(sn+i|sn,  an) 
and  7r(an|sn). 

The  goal  of  RL  is  to  obtain  the  policy  such  that  the  expectation  of  the  discounted 
sum  of  future  rewards  is  maximized.  The  optimal  policy  can  be  expressed  as 

7r*(a|s)  =  6 (a  —  argmaxQ*(s,  a')),  (3) 

a' 

where  <$(•)  is  Dirac’s  delta  function  and 

Q*(s,  a)  =  ma xQn(s,  a) 

7T 


OO 

^7n_1P(sn,an,sn+i)  si  —  s,ai  —  a  ,  (2) 

n=  1 


(4) 
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is  the  optimal  state-action  value  function. 

Qn(s,  a)  can  be  expressed  by  the  following  recurrent  form  called  the  Bellman  equation: 
Qn(s,a)  =  R(s,a)  +  7  E  E  [Qn(s',  a')] ,  Vs  e  «S,  Va  e  A,  (5) 

PT(s'|s,a)  n(af\sf) 

where 

R(s,a)=  E  [R(s,a,  s')]  (6) 

PT(s'\s,a) 

is  the  expected  reward  when  the  agent  takes  action  a  in  state  s.  EpT(s'pia)  denotes  the 
conditional  expectation  of  s'  over  Pt(s/|s,o)  given  s  and  a,  and  E7r(a/|s/)  denotes  the  con¬ 
ditional  expectation  of  a '  over  7r(a'|s')  given  s'. 

2.2  Policy  Iteration 

Computing  the  value  function  Qn(s,  a)  is  called  policy  evaluation.  Using  Qn(s,  a),  we  may 
find  a  better  policy  7r'(a|s)  by  ‘softmax’  update: 

7r'(a|s)  oc  exp(Q7r(s,  a)//3),  (7) 

where  /3  (>  0)  determines  the  randomness  of  the  new  policy  7r';  or  by  e-greedy  update: 

7r'(a|s)  =  epn(a)  +  (1  —  e)S(a  —  argmaxQ7r(s,  a')),  (8) 

a' 

where  pu(a)  denotes  the  uniform  probability  density  over  actions  and  e  (e  (0,1])  deter¬ 
mines  how  stochastic  the  new  policy  n'  is.  Updating  n  based  on  Qn(s,a )  is  called  policy 
improvement.  Repeating  policy  evaluation  and  policy  improvement,  we  may  find  the 
optimal  policy  7r*(a|s).  This  entire  process  is  called  policy  iteration  (Sutton  &  Barto, 
1998). 

2.3  Least-squares  Framework  for  Value  Function  Approxima¬ 
tion 

Although  policy  iteration  is  a  useful  framework  for  solving  an  MDP  problem,  it  is  compu¬ 
tationally  expensive  when  the  number  of  state-action  pairs  |<S|  x  |A|  is  large.  Furthermore, 
when  the  state  space  or  action  space  is  continuous,  |<S|  or  |  A\  becomes  infinite  and  therefore 
it  is  no  longer  possible  to  directly  implement  policy  iteration.  To  overcome  this  problem, 
we  approximate  the  state-action  value  function  Qn(s,  a)  using  the  following  linear  model: 

B 

Q*{s,  a ;  6)  =  9b<j)b(s,  a)  =  0T<fi(s,  a),  (9) 

6=1 


0(«,  a)  =  (0i  (s,  a),  02(s,  a),...,  <j>B{s,  a))T 


where 


(10) 
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are  the  fixed  linearly  independent  basis  functions,  T  denotes  the  transpose,  B  is  the 
number  of  basis  functions,  and 


0  =  (e1,92,...,9B)T 


(11) 


are  model  parameters  to  be  learned.  Note  that  B  is  usually  chosen  to  be  much  smaller 
than  | <5 1  x  |M|  for  computational  efficiency. 

For  iV-step  transitions,  we  ideally  want  to  learn  the  parameters  6  so  that  the  squared 
Bellman  residual  G(G)  is  minimized  (Lagoudakis  &  Parr,  2003): 


0*  =  argmin  CKO), 


G{9)  =  E 


N 


N 


^  ''J  (0  ®n)  B(sn:  On)) 


n= 1 


=  4>(s,a)  —  7  E  E  [(f>{sf,  a')] . 


(12) 

(13) 

(14) 


Ep^  denotes  the  expectation  over  the  joint  probability  density  function  of  an  entire  tra¬ 
jectory: 


N 

Bji  (si ,  a  i ,  s2,  ci 2 ,  -  -  - ,  SjV)  ci x ,  Sjv+ 1 )  =  P\ ( s  1 )  PT  |  sn,  an)7v(an  |  sn) ,  (15) 

n=  1 

where  Pi(s)  denotes  the  initial-state  probability  density  function. 


2.4  Value  Function  Approximation  from  Samples 

Suppose  that  roll-out  data  samples  consisting  of  M  episodes  with  N  steps  are  available 
for  training  purposes.  The  agent  initially  starts  from  randomly  selected  state  sq  follow¬ 
ing  the  initial-state  probability  density  Pi(s)  and  chooses  an  action  based  on  sampling 
policy  7r(an\sn).  Then  the  agent  makes  a  transition  following  the  transition  probabil¬ 
ity  PT(sn+i\sn,  an)  and  receives  a  reward  rn(—  R(sn,an,sn+ 1)).  This  is  repeated  for  N 
steps — thus  the  training  dataset  V77  is  expressed  as 

V*  ee  (4C=1,  (16) 


where  each  episodic  sample  d7^  consists  of  a  set  of  4-tuple  elements  as 

a*  =  /  (A  r *  )\N 

m  l  m,n’  m,rd  J  n=l 

We  use  two  types  of  policies  for  different  purposes:  the  sampling  policy  7r(a|s)  for 
collecting  data  samples  and  the  evaluation  policy  7r(a|s)  for  computing  the  value  function 
Qn.  Minimizing  the  importance-weighted  empirical  generalization  error  GIG),  we  can 


(17) 
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obtain  a  consistent  estimator  of  6*  as  follows: 


6  =  argrnin  0(6), 


M  N 


8(e)  =  jinE  -  4,»)2<w. 


m= 1  n=  1 


i/>(s,  a;  V)  =  </>(s,  a) 


7 


XI  E  [<Ms'>a')], 


I^W)I  .  7r(“'ls') 


(a, a) 


(18) 

(19) 

(20) 


where  T>(SA)  is  a  set  of  4-tuple  elements1  containing  state  s  and  action  a  in  the  training 
data  V,  Y2s'ev(  }  denotes  the  summation  over  s'  in  the  set  P(S)0),  and 


w 


7 r  — 

m,N  — 


ng=j  *(am,n' 
FI  n'=Mam,n' 


(21) 


is  called  the  importance  weight  (Sutton  &  Barto,  1998). 

It  is  important  to  note  that  consistency  of  6  can  be  maintained  even  if  w’^llN  is  re¬ 
placed  by  the  per- decision  importance  weight  n  (Precup,  Sutton,  &  Singh,  2000),  which 
is  computationally  more  efficient  and  stable.  6  can  be  analytically  expressed  with  the  ma¬ 
trices  L  (e  RBxMN),  X  (e  RMNxB),  W  (e  lMArxMAr),  and  the  vector  r *  (e  RMN )  as 


0  =  Lrn, 

L  =  (XTWX)~1XTW, 


N(m—  l)+n  ^ m,m 


x  N(m-l)+n,b  —  t/b(s  m,ni 


w 


N  (m—l)+n,N  (m1 —l)+n‘ 


'  =  W" 


j/(m  =  m!)I{n  =  n'), 


where  / (c)  denotes  the  indicator  function: 

J(c)  = 


1  if  the  condition  c  is  true, 
0  otherwise. 


(22) 

(23) 

(24) 

(25) 

(26) 


(27) 


When  the  matrix  X  W X  is  ill-conditioned,  it  is  hard  to  compute  its  inverse  accurately. 
To  cope  with  this  problem,  we  may  practically  employ  a  regularization  scheme  (Tikhonov 
&  Arsenin,  1977;  Hoerl  &  Kennard,  1970;  Poggio  &  Girosi,  1990): 

(XTWX  +  XI)-1,  (28) 

where  I  (e  RBxB )  is  the  identity  matrix  and  A  is  a  small  positive  scalar. 

1When  the  state-action  space  is  continuous,  the  set  0*  )  contains  only  a  single  sample 

(smn>amn.rmn.smn+i)  an<l  then  consistency  of  0  may  not  be  guaranteed.  A  possible  measure  for 
this  would  be  to  use  several  neighbor  samples  around  (s^  n,  aj^  n).  However,  in  our  experiments,  we 
decided  to  use  the  single-sample  approximation  since  it  performed  reasonably  well. 
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3  Efficient  Exploration  with  Active  Learning 

The  accuracy  of  the  estimated  value  function  depends  on  the  training  samples  collected 
following  sampling  policy  7r(a|s).  In  this  section,  we  propose  a  new  method  for  designing 
a  good  sampling  policy  based  on  a  statistical  AL  method  proposed  in  Sngiyama  (2006). 


3.1  Preliminaries 

Let  us  consider  a  situation  where  collecting  state-action  trajectory  samples  is  easy  and 
cheap,  but  gathering  immediate  reward  samples  is  hard  and  expensive  (for  example,  the 
batting  robot  explained  in  the  introduction).  In  such  a  case,  immediate  reward  samples 
are  too  expensive  to  be  used  for  designing  the  sampling  policy;  only  state-action  trajectory 
samples  may  be  used  for  sampling  policy  design. 

The  goal  of  AL  in  the  current  setup  is  to  determine  the  sampling  policy  so  that 
the  expected  generalization  error  is  minimized.  The  generalization  error  is  not  accessi¬ 
ble  in  practice  since  the  expected  reward  function  R(s,  a)  and  the  transition  probability 
Px('S/|s,a)  are  unknown.  Thus,  for  performing  AL,  the  generalization  error  needs  to  be 
estimated  from  samples.  A  difficulty  of  estimating  the  generalization  error  in  the  context 
of  AL  is  that  its  estimation  needs  to  be  carried  out  only  from  state-action  trajectory 
samples  without  using  immediate  reward  samples.  This  means  that  standard  generaliza¬ 
tion  error  estimation  techniques  such  as  cross-validation  (Hachiya,  Akiyama,  Sugiyama, 
&  Peters,  2009)  cannot  be  employed  since  they  require  both  state-action  and  immediate 
reward  samples.  Below,  we  explain  how  the  generalization  error  could  be  estimated  under 
the  AL  setup  (i.e.,  without  the  reward  samples). 


3.2  Decomposition  of  Generalization  Error 

The  information  we  are  allowed  to  use  for  estimating  the  generalization  error  is  a  set  of 
roll-out  samples  without  immediate  rewards: 

rf  =  iff  \M 

^  —  lura/m=li 

Vf  —  S(<A  rA  )\N 

m  l\°m,rP  °m,n+l/Jn=l 

Let  us  define  the  deviation  of  immediate  rewards  from  the  mean  as 


(29) 

(30) 


(31) 


Note  that  e ^  n  could  be  regarded  as  additive  noise  in  the  context  of  least-squares  function 
fitting.  By  definition,  e^n  has  mean  zero  and  its  variance  generally  depends  on  s^n  n  and 
aAmn,  i.e.,  heteroscedastic  noise  (Bishop,  2006).  However,  since  estimating  the  variance  of 
e^nn  without  using  reward  samples  is  not  generally  possible,  we  ignore  the  dependence  of 
the  variance  on  s'^nn  and  aTm  n.  Let  us  denote  the  input-independent  common  variance  by 
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Now  we  would  like  to  estimate  the  generalization  error 


G{0)  =  E 


1 


N 


B*  & n 5  ^n)) 


71—1 


(32) 


from  .  Its  expectation  over  ‘noise’  can  be  decomposed  as  follows  (Sugiyama,  2006). 


E 


G{0) 


=  Bias  +  Variance  +  ModelError, 


(33) 


where  Eesf  denotes  the  expectation  over  ‘noise’  {ej^  .n}rn=\  n=i-  Bias,  Variance,  and  ModelError 
are  the  bias  term,  the  variance  term,  and  the  model  error  term  defined  by 


Bias  =  E 

Pit 


1 

N 


N 

E 


71—1 


0 


d*)'tl>{sn,an;V  ) 


Variance  e  EE 

Pw  eff 


N 


N 


71—  1 


^  \{0-E  0  )T^(sn,an,VT‘ 


ModelError  =  E 

Pit 


N 


N 


^(0*T^(sn,  an]  V n)  -  R(sn ,  an)f 


71—1 


(34) 

(35) 

(36) 


6*  is  the  optimal  parameter  in  the  model,  defined  by  Eq.(12).  Note  that  the  variance 
term  can  be  expressed  in  a  compact  form  as 


;T, 


Variance  =  a2tr(ULL  ), 
where  the  matrix  U  (e  MSx'B)  is  defined  as 


Ubb ’  =  E 

Pn 


N 


N 


^  Q"ni  'D  )  V’fc'  (®ri)  7? 


71—1 


(37) 


(38) 


3.3  Estimation  of  Generalization  Error  for  AL 


The  model  error  is  constant  and  thus  can  be  safely  ignored  in  generalization  error  es¬ 
timation  since  we  are  interested  in  hireling  a  minimizer  of  the  generalization  error  with 
respect  to  n.  So  we  focus  on  the  bias  term  and  the  variance  term.  However,  the  bias  term 
includes  the  unknown  optimal  parameter  6* ,  and  thus  it  may  not  be  possible  to  estimate 
the  bias  term  without  using  reward  samples;  similarly,  it  may  not  be  possible  to  estimate 
the  ‘noise’  variance  a2  included  in  the  variance  term  without  using  reward  samples. 

It  is  known  that  the  bias  term  is  small  enough  to  be  neglected  when  the  model  is 
approximately  correct  (Sugiyama,  2006),  i.e.,  0*Ti/>(s,  a)  approximately  agrees  with  the 
true  function  R(s,a).  Then  we  have 


E 


G(0)  —  ModelError  —  Bias  oc  tr (ULL 


(39) 
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which  does  not  require  immediate  reward  samples  for  its  computation.  Since  Ep„  included 
in  U  is  not  accessible  (see  Eq.(38)),  we  replace  U  by  its  consistent  estimator  U : 

1  M  N 

U  =  <n,n\  ^ Y Wm,n~  (40) 

m= 1 n=l 

Consequently,  we  have  the  following  generalization  error  estimator: 

J  =  tr(ULLT),  (41) 

which  can  be  computed  only  from  T>“  and  thus  can  be  employed  in  the  AL  scenarios.  If 
it  is  possible  to  gather  T>n  multiple  times,  the  above  J  may  be  computed  multiple  times 
and  its  average  J'  may  be  used  as  a  generalization  error  estimator. 

Note  that  the  values  of  the  generalization  error  estimator  J  and  the  true  generalization 
error  G  are  not  directly  comparable  since  irrelevant  additive  and  multiplicative  constants 
are  ignored  (see  Eq.(39)).  We  expect  that  the  estimator  J  has  a  similar  profile  to  the  true 
error  G  as  a  function  of  sampling  policy  n  since  the  purpose  of  deriving  a  generalization 
error  estimator  in  AL  is  not  to  approximate  the  true  generalization  error  itself,  but  to 
approximate  the  minimizer  of  the  true  generalization  error  with  respect  to  sampling  policy 
tt.  We  will  experimentally  investigate  this  issue  in  Section  3.5. 

3.4  Designing  Sampling  Policies 

Based  on  the  generalization  error  estimator  derived  above,  we  give  an  algorithm  for  de¬ 
signing  a  good  sampling  policy,  which  fully  makes  use  of  the  roll-out  samples  without 
immediate  rewards. 

1.  Prepare  K  candidates  of  sampling  policy:  {'Vfcjy'Li  • 

2.  Collect  episodic  samples  without  immediate  rewards  for  each  sampling-policy  can¬ 
didate:  {T>Kk}fi=1. 

3.  Estimate  U  using  all  samples  {Vnk}^=l  : 

I<  M  N 

V V  V  £$(«£., {vALiW (4%. {£**}£., )T«£.. 

k=l  m=  1  n=  1 

(42) 

4.  Estimate  the  generalization  error  for  each  k\ 

Jk  =  tr  (ULWkLnkT),  (43) 

LWk  =  (xnkTW^kx"k)-lx"kTW^\  (44) 

^(m— l)+n,6  4‘,»;  (45) 

WN(m- l)+n,N(m'-l)+n'  =  =  ^')4(n  =  n').  (46) 
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5.  (If  possible)  repeat  2.  to  4.  several  times  and  calculate  the  average  for  each  k: 

Wk}k=V 

6.  Determine  the  sampling  policy:  7Tal  =  argminfc  J'k. 

7.  Collect  training  samples  with  immediate  rewards  following  ttal-  T>7TAh. 

8.  Learn  the  value  function  by  LSPI  using  T>nAh. 


3.5  Numerical  Examples 

Here  we  illustrate  how  the  above  AL  method  behaves  in  the  10-state  chain-walk  environ¬ 
ment  shown  in  Figure  1.  The  MDP  consists  of  10  states 

5  =  {«<<>}£,  ={1,2,...,  10}  (47) 

and  2  actions 

A  =  {aCi)}^=1  =  {‘L’,  ‘R’}.  (48) 

The  immediate  reward  function  is  defined  as 


R(s,a,s')  =  /(s'),  (49) 

where  the  profile  of  the  function  f(s')  is  illustrated  in  Figure  2. 

The  transition  probability  Pr(s'|s,  a)  is  indicated  by  the  numbers  attached  to  the 
arrows  in  Figure  1;  for  example,  Pi^s^ls^,  ‘R’)  =  0.8  and  Pr(s^1^|s^1\  ‘R’)  =  0.2.  Thus 
the  agent  can  successfully  move  to  the  intended  direction  with  probability  0.8  (indicated 
by  solid-filled  arrows  in  the  figure)  and  the  action  fails  with  probability  0.2  (indicated  by 
dashed-hlled  arrows  in  the  figure).  The  discount  factor  7  is  set  to  0.9.  We  use  the  12 
basis  functions  a)  defined  as 


02(i— l)+j  (s,  o) 


I  (a  =  a^)ex  p 
I  (a  =  a^) 


for  i  =  1, 2, . . . ,  5  and  j  —  1,2 
for  i  =  6  and  j  =  1, 2, 


(50) 


where  <7  =  1,  c2  =  3,  c3  =  5,  C4  =  7,  c5  =  9,  and  r  =  1.5. 

For  illustration  purposes,  we  evaluate  the  selection  of  sampling  policies  only  in  one- 
step  policy  evaluation;  evaluation  through  iterations  will  be  addressed  in  the  next  section. 
Sampling  policies  and  evaluation  policies  are  constructed  as  follows.  First,  we  prepare  a 
deterministic  ‘base’  policy  7f,  e.g.,  ‘LLLLLRRRRR,,  where  the  i-th  letter  denotes  the 
action  taken  at  s^.  Let  7fe  be  the  ‘e-greedy’  version  of  the  base  policy  W,  i.e.,  the  intended 
action  can  be  successfully  chosen  with  probability  1  —  e/2  and  the  other  action  is  chosen 
with  probability  e/2.  We  perform  experiments  for  three  different  evaluation  policies: 


7fi  : 

‘  RRRRRRRRRR  ’ , 

(51) 

T2  : 

‘RRLLLLLRRR,, 

(52) 

T3  : 

‘LLLLLRRRRR, 

(53) 
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Figure  1:  10-state  chain  walk.  Filled/unfillcd  arrows  indicate  the  transitions  when  taking 
action  ‘R’/‘L’  and  solid/dashed  lines  indicate  the  successful/failed  transitions. 


Figure  2:  Profile  of  the  function  f(s'). 


with  e  =  0.1.  For  each  evaluation  policy  nf1  ( i  =  1,  2, 3),  we  prepare  10  candidates  of  the 
sampling  policy  where  the  k-th  sampling  policy  is  defined  as  ' ° .  Note 

that  7T,-1)  is  equivalent  to  the  evaluation  policy  nf1. 

For  each  sampling  policy,  we  calculate  the  J-value  5  times  and  take  the  average. 
The  numbers  of  episodes  and  steps  are  set  to  M  —  10  and  N  =  10,  respectively.  The 
initial-state  probability  Pi(s)  is  set  to  be  uniform.  The  regularization  parameter  is  set 
to  A  =  10-3  for  avoiding  matrix  singularity.  This  experiment  is  repeated  100  times  with 
different  random  seeds  and  the  mean  and  standard  deviation  of  the  true  generalization 
error  and  its  estimate  are  evaluated. 

The  results  are  depicted  in  Figure  3  (the  true  generalization  error)  and  Figure  4  (its 
estimate)  as  functions  of  the  index  k  of  the  sampling  policies.  Note  that  in  these  figures, 
we  ignored  irrelevant  additive  and  multiplicative  constants  when  deriving  the  general¬ 
ization  error  estimator  (see  Eq.(39)).  Thus,  directly  comparing  the  values  of  the  true 
generalization  error  and  its  estimate  is  meaningless.  The  graphs  show  that  the  proposed 
generalization  error  estimator  overall  captures  the  trend  of  the  true  generalization  error 
well  for  all  three  cases. 

For  active  learning  purposes,  we  are  interested  in  choosing  the  value  of  k  so  that  the 
true  generalization  error  is  minimized.  Next,  we  investigate  the  values  of  the  obtained 
generalization  error  G  when  k  is  chosen  so  that  J  is  minimized  (active  learning;  AL),  the 
evaluation  policy  ( k  =  1)  is  used  for  sampling  (passive  learning;  PL),  and  k  is  chosen 
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Figure  3:  The  mean  and  standard  deviation  of  the  true  generalization  error  over  100 
trials. 


Figure  4:  The  mean  and  standard  deviation  of  the  estimated  generalization  error  J  over 
100  trials. 


3.5  r 
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Figure  5:  The  box-plots  of  the  values  of  the  obtained  generalization  error  G  over  100 
trials  when  k  is  chosen  so  that  J  is  minimized  (active  learning;  AL),  the  evaluation 
policy  ( k  =  1)  is  used  for  sampling  (passive  learning;  PL),  and  k  is  chosen  optimally  so 
that  the  true  generalization  error  is  minimized  (optimal;  OPT).  The  box-plot  notation 
indicates  the  5%-quantile,  25%-quantile,  50%-quantile  (i.e.,  median),  75%-quantile,  and 
95%-quantile  from  bottom  to  top. 
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optimally  so  that  the  true  generalization  error  is  minimized  (optimal;  OPT).  Figure  5 
depicts  the  box-plots  of  the  generalization  error  values  for  AL,  PL,  and  OPT  over  100 
trials,  where  the  box-plot  notation  indicates  the  5%-quantile,  25%-quantile,  50%-quantile 
(i.e. ,  median),  75%-quantile,  and  95%-quantile  from  bottom  to  top.  The  graphs  show  that 
the  proposed  AL  method  compares  favorably  with  PL  and  performs  well  for  reducing  the 
generalization  error. 

We  will  continue  the  performance  evaluation  of  the  proposed  AL  method  through 
iterations  in  Section  4.2. 


4  Active  Learning  in  Policy  Iteration 

In  Section  3,  we  have  shown  that  the  unknown  generalization  error  could  be  accurately 
estimated  without  using  immediate  reward  samples  in  one-step  policy  evaluation.  In  this 
section,  we  extend  the  idea  to  the  full  policy-iteration  setup. 


4.1  Sample  Reuse  Policy  Iteration  with  Active  Learning 

Sample  reuse  policy  iteration  (SRPI)  (Hachiya  et  ah,  2009)  is  a  recently-proposed  frame¬ 
work  of  off-policy  RL  (Sutton  &  Barto,  1998;  Precup  et  al.,  2000),  which  allows  us  to 
reuse  previously-collected  samples  effectively.  Let  us  denote  the  evaluation  policy  at  the 
Z-th  iteration  by  7 q  and  the  maximum  number  of  iterations  by  L. 

In  the  policy  iteration  framework,  new  data  samples  Vn‘  are  collected  following  the 
new  policy  7 q  for  the  next  policy  evaluation  step.  In  ordinary  policy-iteration  methods, 
only  the  new  samples  Vni  are  used  for  policy  evaluation.  Thus  the  previously-collected 
data  samples  {VWl,  V*2, . . . ,  P717-1}  are  not  utilized: 


7Ti  — >  Q  1  —>  7T2  — > 


I 

~ t  7Tl_|_i, 


(54) 


where  ‘E  :  {T>}’  indicates  policy  evaluation  using  the  data  sample  T>  and  ‘I’  denotes 
policy  improvement.  On  the  other  hand,  in  SRPI,  all  previously-collected  data  samples 
are  reused  for  policy  evaluation  as 


7Tl 


E:{Vnl}  I  E:{X>7rl,D,r2}  i  E:{X>7rl  ,Dn2 ,25^3} 

— t  Q  ->  vr2  ->■  Q  2  ->•  7T3  -> 


I 

•  •  •  — >  7TL+!, 


(55) 


where  appropriate  importance  weights  are  applied  to  each  set  of  previously-collected  sam¬ 
ples  in  the  policy  evaluation  step. 

Here,  we  apply  the  AL  technique  proposed  in  the  previous  section  to  the  SRPI  frame¬ 
work.  More  specifically,  we  optimize  the  sampling  policy  at  each  iteration.  Then  the 
iteration  process  becomes 

EifAu}  ~  I  E:{7??1,D?2}  p.  I  E:{X>?l,X>ff2 ,^3}  I 

7T  i  t  Q  t  7T2  — >  <5  — )■  7T 3  — >  •••—)■  7Tl+1- 


Thus,  we  do  not  gather  samples  following  the  current  evaluation  policy  7 q,  but  following 
the  sampling  policy  7 q  optimized  based  on  the  AL  method  given  in  the  previous  section. 
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We  call  this  framework  active  policy  iteration  (API).  Figure  6  and  Figure  7  show  the 
pseudo  code  of  the  API  algorithm.  Note  that  the  previously-collected  samples  are  used 
not  only  for  value  function  approximation,  but  also  for  sampling-policy  selection.  Thus 
API  fully  makes  use  of  the  samples. 

4.2  Numerical  Examples 

Here  we  illustrate  how  the  API  method  behaves  using  the  same  10-state  chain- walk  prob¬ 
lem  as  Section  3.5  (see  Figure  1). 

The  initial  evaluation  policy  717  is  set  as 

^(ajs)  =  0.15pu(a)  +  0.85 /(a  =  argmaxQ0(s,  a')),  (57) 

a' 


where 


12 

Qo(s,a )  =  h(s,a ). 

6=1 


(58) 


Policies  are  updated  in  the  Z-th  iteration  using  the  e-greedy  rule  with  e  =  0.15//.  In  the 
sampling-policy  selection  step  of  the  Z-th  iteration,  we  prepare  the  four  sampling-policy 
candidates 


(A 


(1)  ~(2)  ~(3)  ~(4h  _  r— 0.15//  — 0.15/Z+0.15  0.15//+0.5  0.15//+0.85 

/I  ;  f  —  1  /I  I  ,  /I  ;  ,  /I  ;  ,  /I  ; 


,717 


,TT 


}. 


(59) 


where  7 fj  denotes  the  policy  obtained  by  greedy  update  using  Q 7ri~1.  The  number  of 
iterations  to  learn  the  policy  is  set  to  L  —  7,  the  number  of  steps  is  set  to  N  =  10,  and 
the  number  M  of  episodes  is  different  in  each  iteration  and  defined  as 


{Mi,  M2,  M3,  M4,  M5,  Mq,  M7},  (60) 

where  M;  (/  =  1,  2, . . . ,  7)  denotes  the  number  of  episodes  collected  in  the  Z-th  itera¬ 
tion.  In  this  experiment,  we  compare  two  types  of  scheduling:  {5,  5,  3,  3,  3, 1, 1}  and 
{3,  3,  3,  3, 3,  3,  3},  which  we  refer  to  as  the  ‘decreasing  M’  strategy  and  the  ‘fixed  M’ 
strategy,  respectively.  The  J-value  calculation  is  repeated  5  times  for  AL.  In  order  to 
avoid  matrix  singularity,  the  regularization  parameter  is  set  to  A  =  10~3.  The  perfor¬ 
mance  of  learned  policy  Kl+i  is  measured  by  the  discounted  sum  of  immediate  rewards 
for  test  samples  {rm,n1}mn= 1  (50  episodes  with  50  steps  collected  following  ttl+i)'- 

1  50  50 

Performance  =  —  EEt-VS1,  (61) 

m=  1  71—I 


where  the  discount  factor  7  is  set  to  0.9. 

We  compare  the  performance  of  passive  learning  (PL;  the  current  policy  is  used  as  the 
sampling  policy  in  each  iteration)  and  the  proposed  AL  method  (the  best  sampling  policy 
is  chosen  from  the  policy  candidates  prepared  in  each  iteration).  We  repeat  the  same 
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Algorithm  1:  ActivePolicyIteration((f),ir\,  A,  Z) 

/ / cf>  Basis  functions,  4>(s,  a)  =  (</>i(s,  a),  (s,  a), . . . ,  4>b{s,  a))T 

/ / 7Ti  Initial  policy,  7ri(a|s)  €  [0, 1] 

//A  Regularization  parameter,  A  >  0 

/ /Z  The  number  of  J- value  calculations  to  take  the  average  J' ,  Z  €  N 

l  v-  1 

for  l  4—  1,  2, . . . ,  L 

'  / /  Determine  sampling  policy  7 p  by  the  active  learning  method 

tti  V-  SarnplingPolicySelection({'D7ri'}ll7;}i,  <t>,  7 p,  A,  Z) 

/ /  Collect  episodic  samples  using  policy  i p 
D77'  DataSampling  (ni) 


do  < 


//  Learn  the  value  function  Q717  from  the  samples  {T>7ri'}|,=1 
i  M  N 


B 


e, 


IMN 
1 

TWn 


l'= 1  m=  1  n=  1 
l  M  N 


EEE  ■0(Sm'n,am'n;  }f/=i  )? 


Z'=l  m=l  n=l 


(A  +  XI)-1  B 


/ /  Update  7p  using  Q717 
7T/_|_^  <—  Policy  Improvement  (6 1 ,  tf>) 
return  (nL+i) 


Figure  6:  The  pseudo  code  of  ActivePolicy Iteration.  By  the  DataSampling  function, 
episodic  samples  (M  episodes  and  N  steps)  are  collected  using  the  input  policy.  By  the 
P olicy Improvement  function,  the  current  policy  is  updated  with  policy  improvement  such 
as  e-greedy  update  or  softmax  update.  The  pseudo  code  of  Sampling  Policy  Selection  is 
shown  in  Algorithm  2  in  Figure  7. 
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Algorithm  2:  SamplingPolicySelection({'D7Tl'}llH1,  <f>WU  A,  Z) 

/ /{D7r*'}j,"l11  The  previously-collected  training  samples  up  to  (l  —  l)-th  iteration 


H4> 
I  In 
//A 

nz 


Basis  functions,  <f>(s,  a)  =  (</>i(s,  a),  4>2 (s,  a), . . . ,  c/>b(s,  a))T 
The  evaluation  policy  in  the  Z-tli  iteration,  717  (a|s)  (G  [0, 1]) 
Regularization  parameter,  A  (>  0) 

The  number  of  J-value  calculations  to  compute  the  average  J',  Z  (e  N) 


for  z  <—  1,2, ...  ,Z 
for  k  V-  1,2, 


do  < 


,  K 


do  < 


//Generate  sampling  policy  candidate  and  collect  episodic  samples 
/ / without  immediate  rewards  using  7 f^ 


(0 


T)7rfe  •<—  RewardlessDataSampling  (irY ) 


/ /Estimate  matrix  £/ 


f(0 


//p0  =  Iti  u  n0  =  {^}Li  u 

M  N 

U 


(K  +  l  -  1)MN 


EEE  ^o)*0(^77i5n5  ^0)  wrn,r 


7rGlIo  m=l  n=l 


for  k  V-  1,2, ...  ,K 

(  / /Calculate  ,/ 
-  r=(0 


do  < 


//nfc  =  U  /&,„  =  <,ne(iV{m-1)+n)  (G  MMJV), 

//e^  (€  RMiV)  is  the  standard  basis  vector:  e'Y  =  I(i  =  j ) 

M  TV 

^TEEE  ^o)VK®m,ra>  ®m,n>  ^0)  ^m,r 


B 


WIN 

1 

WIN 


Trent,  m=l  n=l 
M  N 


Trent,  m=l  n=l 

<-  (A  +  AI)_1S 


[4  <-  tr(LTLfcL* 


T 


//Choose  the  policy  7tal  which  minimizes  J(,  =  ^  Hz=\  =  1,2,...,  A') 

TA.L  •<—  arginine  J'k 

K 

return  (7tAl) 


Figure  7:  The  pseudo  code  of  Sampling  Policy  Selection.  In  the  function  Rewardless- 
DataSampling,  episodic  samples  without  immediate  rewards  (M  episodes  and  N  steps) 
are  collected.  Previously-collected  training  samples  {V711'  }j,=1  are  used  for  the  calculation 
of  matrices  U ,  A,  and  B  in  J-value  calculation. 
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Figure  8:  The  mean  performance  over  1000  trials  in  the  10-state  chain-walk  experiment. 
The  dotted  lines  denote  the  performance  of  passive  learning  (PL)  and  the  solid  lines 
denote  the  performance  of  the  proposed  active  learning  (AL)  method.  The  error  bars 
are  omitted  for  clear  visibility.  For  both  the  ‘decreasing  M'  and  ‘fixed  M'  strategies,  the 
performance  of  AL  after  the  7-th  iteration  is  significantly  better  than  that  of  PL  according 
to  the  two-tailed  paired  Student  t-test  at  the  significance  level  1%  applied  to  the  error 
values  at  the  7-th  iteration. 

experiment  1000  times  with  different  random  seeds  and  evaluate  the  average  performance 
of  each  learning  method.  The  results  are  depicted  in  Figure  8,  showing  that  the  proposed 
AL  method  works  better  than  PL  in  both  types  of  episode  scheduling  with  statistical 
significance  by  the  two-tailed  paired  Student  t-test  at  the  significance  level  1%  (Henkel, 
1979)  for  the  error  values  obtained  at  the  7-th  iteration.  Furthermore,  the  ‘decreasing  M ’ 
strategy  outperforms  the  ‘fixed  M'  strategy  for  both  PL  and  AL,  showing  the  usefulness 
of  the  ‘decreasing  M'  strategy. 

5  Experiments 

Finally,  we  evaluate  the  performance  of  the  proposed  API  method  using  a  ball-batting 
robot  illustrated  in  Figure  9,  which  consists  of  two  links  and  two  joints.  The  goal  of 
the  ball-batting  task  is  to  control  the  robot  arm  so  that  it  drives  the  ball  as  far  away 
as  possible.  The  state  space  S  is  continuous  and  consists  of  angles  g)\  [rad]  (e  [0, 7r/4] ) 
and  </?2[rad]  (e  [— 7t/4,  vt/4])  and  angular  velocities  [rad/s]  and  p2[r ad/s].  Thus  a  state 
s  (e  S)  is  described  by  a  four- dimensional  vector: 


(62) 
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The  action  space  A  is  discrete  and  contains  two  elements: 

A  =  {a«}?=1  =  {(50,  — 35)t,  (-50, 10)T},  (63) 


where  the  i-th  element  {i  =  1,2)  of  each  vector  corresponds  to  the  torque  [N  •  m]  added 
to  joint  i. 

We  use  the  Open  Dynamics  Engine  (‘http://ode.org/’)  for  physical  calculations  in¬ 
cluding  the  update  of  the  angles  and  angular  velocities,  and  collision  detection  between 
the  robot  arm,  ball,  and  pin.  The  simulation  time-step  is  set  to  7.5  [ms]  and  the  next 
state  is  observed  after  10  time-steps.  The  action  chosen  in  the  current  state  is  kept  taken 
for  10  time-steps.  To  make  the  experiments  realistic,  we  add  noise  to  actions:  if  action 
(/i,  f-2)T  is  taken,  the  actual  torques  applied  to  the  joints  are  fi  +£\  and  +  £2,  where  £1 
and  £2  are  drawn  independently  from  the  Gaussian  distribution  with  mean  0  and  variance 
3. 

The  immediate  reward  is  defined  as  the  carry  of  the  ball.  This  reward  is  given  only 
when  the  robot  arm  collides  with  the  ball  for  the  first  time  at  state  s'  after  taking  action 
a  at  current  state  s.  For  value  function  approximation,  we  use  the  110  basis  functions 
defined  as 


I  (a  =  aw)exp  -  ’S  ^  ^  for  i  =  1,  2, . . . ,  54  and  j  =  1,  2,  ^ 

I(a  =  a^)  for  i  =  55  and  j  =  1, 2, 


where  r  is  set  to  3n/2  and  the  Gaussian  centers  q  ( i  =  1,  2, ... ,  54)  are  located  on  the 
regular  grid 

{0,  7r/ 4}  X  {  —  7T,  0,  7r}  X  {—  7t/4,  0,  7t/4}  X  {—  7T,0,7r}.  (65) 

We  set  L  =  7  and  N  =  10.  We  again  compare  the  ‘decreasing  M'  strategy  and  the 
‘fixed  M'  strategy.  The  ‘decreasing  M'  strategy  is  defined  by  {10,10,7,7,7,4,4}  and 
the  ‘fixed  M'  strategy  is  defined  by  {7,  7,  7,  7,  7,  7,  7}.  The  initial  state  is  always  set  to 
s  =  (tt/4,  0,  0,  0)T.  The  regularization  parameter  is  set  to  A  =  10~3  and  the  number  of 
./-calculations  in  the  AL  method  is  set  to  5.  The  initial  evaluation  policy  717  is  set  to  the 
e-greedy  policy  defined  as 

^(ajs)  =  0.15pu(a)  +  0.85/(a  =  argmax(5o('S,  a')),  (66) 

a' 

no 

Qo(s,a )  =  y "]<fo(s,a).  (67) 

6=1 


Policies  are  updated  in  the  Z-th  iteration  using  the  e-greedy  rule  with  e  =  0.15//.  The 
way  we  prepare  sampling-policy  candidates  is  the  same  as  the  chain-walk  experiment  in 
Section  4.2. 

The  discount  factor  7  is  set  to  1  and  the  performance  of  learned  policy  tt L+i  is  measured 
by  the  discounted  sum  of  immediate  rewards  for  test  samples  {r]/}1  }'m=i  ,n=\  (20  episodes 
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with  10  steps  collected  following  n i+1): 

M  N 

Performance  =  EEC's1-  (68) 

m— 1 n=l 

The  experiment  is  repeated  500  times  with  different  random  seeds  and  the  average 
performance  of  each  learning  method  is  evaluated.  The  results  are  depicted  in  Figure  10, 
showing  that  the  proposed  API  method  outperforms  the  PL  strategy;  for  the  ‘decreasing 
M’  strategy,  the  performance  difference  is  statistically  significant  by  the  two-tailed  paired 
Student  t-test  at  the  significance  level  1%  for  the  error  values  at  the  7-th  iteration. 

Based  on  the  experimental  evaluation,  we  conclude  that  the  proposed  sampling-policy 
design  method,  API,  is  useful  for  improving  the  RL  performance.  Moreover,  the  ‘decreas¬ 
ing  M'  strategy  is  shown  to  be  a  useful  heuristic  to  further  enhance  the  performance  of 
API. 

6  Conclusions  and  Future  Work 

When  we  cannot  afford  to  collect  many  training  samples  due  to  high  sampling  costs, 
it  is  crucial  to  choose  the  most  ‘informative’  samples  for  efficiently  learning  the  value 
function.  In  this  paper,  we  proposed  a  new  data  sampling  strategy  for  reinforcement 
learning  based  on  a  statistical  active  learning  method  proposed  by  Sugiyama  (2006).  The 
proposed  procedure  called  active  policy  iteration  (API) — which  effectively  combines  the 
framework  of  sample-reuse  policy  iteration  (Hachiya  et  ah,  2009)  with  active  sampling- 
policy  selection — was  shown  to  perform  well  in  simulations  with  chain-walk  and  batting 
robot  control. 

Our  active  learning  strategy  is  a  batch  method  and  does  not  require  previously  col¬ 
lected  reward  samples.  However,  in  the  proposed  API  framework,  reward  samples  are 
available  from  the  previous  iterations.  A  naive  extension  would  be  to  include  those  pre¬ 
vious  samples  in  the  generalization  error  estimator,  for  example,  following  the  two-stage 
active  learning  scheme  proposed  by  Kanamori  &  Shimodaira  (2003),  in  which  both  the 
bias  and  variance  terms  are  estimated  using  the  labeled  samples.  However,  such  a  bias- 
and-variance  approach  was  shown  to  perform  poorly  compared  with  the  variance-only 
approach  (which  we  used  in  the  current  paper)  (Sugiyama,  2006).  Thus,  developing  an 
active  learning  strategy  which  can  effectively  make  use  of  previously  collected  samples  is 
an  important  future  work. 

For  the  case  where  the  number  of  episodic  samples  to  be  gathered  is  fixed,  we  gathered 
many  samples  in  earlier  iterations,  rather  than  gathering  samples  evenly  in  each  iteration. 
Although  this  strategy  was  shown  to  perform  well  in  the  experiments,  so  far  we  do  not  have 
strong  justification  for  this  heuristic  yet.  Thus  theoretical  analysis  would  be  necessary  for 
understanding  the  mechanism  of  this  approach  and  further  improving  the  performance. 

In  the  proposed  method,  the  basis  function  i/?(s,a)  defined  by  Eq.(14)  was  approxi¬ 
mated  by  ,i^(ys,a,'D)  defined  by  Eq.(20)  using  samples.  When  the  state-action  space  is 
continuous,  this  is  theoretically  problematic  since  only  a  single  sample  is  available  for 
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(Object  Settings) 


0.65  m  (length),  11.5[kg]  (mass) 
0.35  m  (length),  6 . 2 [kg]  (mass) 
0.1  [m]  (radius),  0.1  [kg]  (mass) 
0.3[m]  (height),  7.5 [kg]  (mass) 


Figure  9:  A  ball-batting  robot. 


Figure  10:  The  mean  performance  over  500  trials  in  the  ball-batting  experiment.  The 
dotted  lines  denote  the  performance  of  passive  learning  (PL)  and  the  solid  lines  denote 
the  performance  of  the  proposed  active  learning  (AL)  method.  The  error  bars  are  omitted 
for  clear  visibility.  For  the  ‘decreasing  M'  strategy,  the  performance  of  AL  after  the  7-th 
iteration  is  significantly  better  than  that  of  PL  according  to  the  two-tailed  paired  Student 
t-test  at  the  significance  level  1%  for  the  error  values  at  the  7-th  iteration. 


approximation  and  thus  consistency  may  not  be  guaranteed.  Although  we  experimentally 
confirmed  that  the  single-sample  approximation  gave  reasonably  good  performance,  it  is 
important  to  theoretically  investigate  the  convergence  issue  in  the  future  work. 

The  R-max  strategy  (Brafman  &  Tennenholtz,  2002)  is  an  approach  to  controlling  the 
trade-off  between  exploration  and  exploitation  in  the  model-based  RL  framework.  The 
LSPI  R-max  method  (Strchl  et  ah,  2007;  Li,  Littman,  &  Mansley,  2009)  is  an  application 
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of  the  R-max  idea  to  the  LSPI  framework.  It  is  therefore  interesting  to  investigate  the 
relation  between  the  LSPI  R-max  method  and  the  proposed  method.  Moreover,  explor¬ 
ing  alternative  active  learning  strategies  in  the  model-based  RL  formulation  would  be  a 
promising  research  direction  in  the  future. 
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Abstract 

Over  the  recent  years,  a  great  deal  of  effort  has  been  made  to  age  estimation  from 
face  images.  It  has  been  reported  that  age  can  be  accurately  estimated  under 
controlled  environment  such  as  frontal  faces,  no  expression,  and  static  lighting  con¬ 
ditions.  However,  it  is  not  straightforward  to  achieve  the  same  accuracy  level  in 
real-world  environment  because  of  considerable  variations  in  camera  settings,  facial 
poses,  and  illumination  conditions.  In  this  paper,  we  apply  a  recently-proposed 
machine  learning  technique  called  covariate  shift  adaptation  to  alleviating  lighting 
condition  change  between  laboratory  and  practical  environment.  Through  real- 
world  age  estimation  experiments,  we  demonstrate  the  usefulness  of  our  proposed 
method. 


Keywords 

face  recognition,  age  estimation,  covariate  shift  adaptation,  lighting  condition  change, 
Kullback-Leibler  importance  estimation  procedure,  importance-weighted  regular¬ 
ized  least-squares 


1  Introduction 

In  recent  years,  demographic  analysis  in  public  places  such  as  shopping  malls  and  stations 
is  attracting  a  great  deal  of  attention.  Such  demographic  information  is  useful  for  various 
purposes  including  designing  effective  marketing  strategies  and  targeted  advertisement 
based  on  customers’  gender  and  age.  For  this  reason,  a  number  of  approaches  have  been 
explored  for  age  estimation  from  face  images  [2,  3],  and  several  databases  became  publicly 
available  recently  [1,  6]. 
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The  recognition  performance  of  age  prediction  systems  is  significantly  influenced,  e.g., 
by  the  type  of  camera,  camera  calibration,  and  lighting  variations,  and  the  publicly  avail¬ 
able  databases  were  mainly  collected  in  semi-controlled  environment.  For  this  reason, 
existing  age  prediction  systems  built  upon  such  databases  tend  to  perform  poorly  in  real- 
world  environment. 

The  situation  where  training  and  test  data  are  drawn  from  different  distributions  is 
called  covariate  shift  [8,  11,  12],  In  this  paper,  we  formulate  the  problem  of  age  estimation 
in  real- world  environment  as  a  supervised  learning  problem  under  covariate  shift.  Within 
the  covariate  shift  framework,  a  method  called  importance-weighted  least-squares  allows 
us  to  alleviate  the  influence  of  environmental  changes,  by  assigning  higher  weights  to  data 
samples  having  high  test  input  densities  and  low  training  input  densities.  We  demonstrate 
through  real-world  experiments  that  age  estimation  based  on  covariate  shift  adaptation 
achieves  higher  accuracy  than  baseline  approaches. 

2  Proposed  Method 

In  this  section,  we  formulate  the  problem  of  age  estimation  as  a  supervised  learning 
problem  under  covariate  shift,  and  then  describe  our  proposed  method. 

2.1  Formulation 

Throughout  this  paper,  we  perform  age  estimation  based  not  on  subjects’  real  age,  but 
on  their  perceived  age.  Thus,  the  ‘true’  age  of  the  subject  y  is  defined  as  the  average 
perceived  age  evaluated  by  those  who  observed  the  subject’s  face  images  (the  value  is 
rounded-off  to  the  nearest  integer). 

Let  us  consider  a  regression  problem  of  estimating  the  age  y*  of  subject  x  (face  fea¬ 
tures).  We  use  the  following  model  for  regression. 


Tltr 

/(*;<*)  =  ^2oiiK(x,xY), 

i=  1 


(i) 


where  ct  =  (a1? . . .  ,antr)T  is  a  model  parameter,  T  denotes  the  transpose,  and  K(x,x') 
is  a  positive  definite  kernel  [7]. 

Suppose  we  are  given  labeled  training  data  {  (x1f,  ?/)r)}”=1-  A  standard  approach  to 
learning  the  model  parameter  a  would  be  regularized  least-squares  [4], 


min 

(X 


1 

Tltr 


Tltr 


f(xT-,o)r+ 


(2) 


where  ||  •  ||  denotes  the  Euclidean  norm,  and  A(>  0)  is  the  regularization  parameter  to 
avoid  overfitting. 

Below,  we  explain  that  merely  using  regularized  least-squares  is  not  appropriate  in 
real-world  perceived  age  prediction,  and  show  how  to  cope  with  this  problem. 
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Age  [year] 


Figure  1:  The  relation  between  subjects’  perceived  age  y*  (horizontal  axis)  and  its  stan¬ 
dard  deviation  (vertical  axis) 


2.2  Incorporating  Age  Perception  Characteristics 


Human  age  perception  is  known  to  have  heterogeneous  characteristics,  e.g.,  it  is  rare  to 
misjudge  the  age  of  a  5-year-olcl  child  as  15  years  old,  but  the  age  of  a  35-year-old  person 
is  often  misjudged  as  45  years  old.  In  order  to  quantify  this  phenomenon,  a  large-scale 
questionnaire  survey  was  carried  out  in  [15]:  Each  of  72  volunteers  were  asked  to  give 
age  labels  y  to  approximately  1000  face  images.  Figure  1  depicts  the  relation  between 
subjects’  perceived  age  y*  and  its  standard  deviation.  This  shows  that  the  perceived  age 
deviation  tends  to  be  small  in  younger  age  brackets  and  large  in  older  age  brackets. 

In  order  to  match  characteristics  of  our  age  prediction  system  to  those  of  human  age 
perception,  we  weight  the  goodness-of-fit  term  in  Eq.  (2)  according  to  the  inverse  variance 
of  the  perceived  age: 


min 

CX 


1  yMyr-/(»r;°0)2 

ntr  “  Wageiy? Y 


+ 


(3) 


where  wage{y)  is  the  standard  deviation  of  the  perceived  age  (see  Figure  1  again). 


2.3  Coping  with  Lighting  Condition  Change 

When  designing  age  estimation  systems,  the  environment  of  recording  training  face  images 
is  often  different  from  the  test  environment  in  terms  of  lighting  conditions.  Typically, 
training  data  are  recorded  indoors  such  as  a  studio  with  appropriate  illumination.  On 
the  other  hand,  in  real-world  environment,  lighting  conditions  have  considerable  varieties, 
e.g.,  strong  sunlight  might  be  cast  from  a  side  of  faces  or  there  is  no  enough  light.  In  such 
situations,  age  estimation  accuracy  is  significantly  degraded. 

Let  ptr{x )  be  the  training  input  density  and  pte(x)  be  the  test  input  density.  When 
these  two  densities  are  different,  it  would  be  natural  to  emphasize  the  influence  of  train- 
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ing  samples  (xf',ylr)  which  have  high  similarity  to  data  in  the  test  environment.  Such 
adjustment  can  be  systematically  carried  out  as  follows  [8,  11,  12]: 


mm 

CL 


1 

Tltr 


Tltr 

E 

2—1 


^ imp  ( 


tr 


(vT  -  /(^r;«))5 


WageiVi 


tr\2 


+ 


(4) 


i.e.,  the  goodness-of-fit  term  in  Eq.(3)  is  weighted  according  to  the  importance  function : 


^imp\*E ) 


Pte(X) 
Ptr(x ) 


The  solution  of  Eq.(4)  can  be  obtained  analytically  by 

ci  =  (KtrWtrKtr  +  ntr\Intr)~l  Ktr  Wtrytr, 
where  Ktr  is  the  kernel  matrix  whose  (i,i')-th  element  is  dehned  by 


(5) 


K»=K{x?,x$), 

Wtr  is  the  ntr-dimensional  diagonal  matrix  with  (i,i)-th  diagonal  element  defined  by 

ti rtr  _  wim.p{xi  ) 

i wage{ytr 

Intr  is  the  rqr-diniensional  identity  matrix,  and  ytr  is  the  ntr-climensional  vector  with  i-th 
element  defined  by  yjr. 

When  the  number  of  training  data  ntr  is  large,  we  may  reduce  the  number  of  kernels 
in  Eq.(l)  so  that  the  inverse  matrix  in  Eq.(5)  can  be  computed  with  limited  memory;  or 
we  may  compute  the  solution  numerically  by  a  stochastic  gradient- decent  method. 


2.4  Importance- Weighted  Cross-Validation  (IWCV) 

In  supervised  learning,  the  choice  of  models  (for  example,  the  basis  functions  and  the 
regularization  parameter)  is  crucial  for  obtaining  better  performance.  Cross-validation 
(CV)  would  be  one  of  the  most  popular  techniques  for  model  selection  [9].  CV  has 
been  shown  to  give  an  almost  unbiased  estimate  of  the  generalization  error  with  finite 
samples  [7],  but  such  almost  unbiasedness  is  no  longer  fulfilled  under  covariate  shift. 

To  cope  with  this  problem,  a  variant  of  CV  called  importance-weighted  CV  (IWCV) 
has  been  proposed  [11].  Let  us  randomly  divide  the  training  set 

2={(*r,i/f)}£i 

into  T  disjoint  non-empty  subsets  {Zt}J=1  of  (approximately)  the  same  size.  Let  fzt{x) 
be  a  function  learned  from  Z\Zt  (i.e.,  without  Zt).  Then  the  T -fold  IWCV  (IWCV) 
estimate  of  the  generalization  error  is  given  by 


1 

T 


E 


1 


2k 


Wimp[X)  C fzt(x)-y)2 , 


,  Wage(y )2 

( x,y)ezt 
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Table  1:  Pseudo  code  of  KLIEP.  ‘./’  indicates  the  element-wise  division.  Inequalities  and 
the  ‘max’  operation  for  vectors  are  applied  in  an  element-wise  manner. 

Input:  {xf  }]=i 

Output:  w{x) 

Choose  {ck}bk=1  as  a  subset  of  {xj6}^; 

Ajik  <-  exp  (-|| xf  -  Cfc||2/ (2q2)) ; 

nL  Er=i  exP  Hl*r  -  cfc||2/(272)); 

Initialize  f3[>  0)  and  e  (0  <  e  <C  1); 

Repeat  until  convergence 

(3  <-  £AT(l./A/3y, 

(3^  (3  +  (l-bT(3)b/(bTby 
(3  <—  max(0,  j3); 

0  <-  0/(bT0y, 

end 


where  \Zt\  denotes  the  number  of  samples  in  the  subset  Zt. 

It  was  proved  that  IWCV  gives  an  almost  unbiased  estimate  of  the  generalization  error 
even  under  covariate  shift  [11], 

2.5  Kullback-Leibler  Importance  Estimation  Procedure 
(KLIEP) 

In  order  to  compute  the  solution  (5)  or  performing  IWCV,  we  need  the  importance  weights 
U!imP(xtjr)  =  pte(xlr)/ptr(xtjr),  which  include  two  probability  densities  ptr(x )  and  pte( x). 
However,  since  density  estimation  is  a  hard  problem,  a  two-stage  approach  of  first  esti¬ 
mating  ptr(x )  and  pte{x)  and  then  taking  their  ratio  may  not  be  reliable.  Here  we  describe 
a  method  called  Kullback-Leibler  Importance  Estimation  Procedure  (KLIEP)  [12],  which 
allows  us  to  directly  estimate  the  importance  function  wimp(x )  without  going  through 
density  estimation  of  ptr(x)  and  pte{x). 

Let  us  model  Wimp(x )  using  the  following  model: 

b  (  Mjc _ Cfc||2\ 

Wimp{x)  =  exp  y - 2y2 - J  ’  (6) 

where  (3  =  (j3i, . . . ,  (3b)T  is  a  parameter,  and  {ck} |=1  is  a  subset  of  test  input  samples 
{xf  YjL^  Using  the  model  wimp(x),  we  can  estimate  the  test  input  density  pte(x )  by 

Pte{x)  =  wimp(x)ptr(x).  (7) 

We  determine  the  parameter  (3  in  the  model  (7)  so  that  the  Kullback-Leibler  divergence 
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Figure  2:  Examples  of  face  images  under  different  lighting  conditions  (left:  standard 
lighting,  middle:  dark,  right:  strong  light  from  a  side) 


from  pte  to  pte  is  minimized: 


W'irnp  (x)dx . 


We  ignore  the  first  term  (which  is  a  constant)  and  impose  Wimp(x)  to  be  non-negative 
and  normalized.  Then  we  obtain  the  following  convex  optimization  problem: 


A  pseudo  code  of  KLIEP  is  described  in  Table  1.  The  tuning  parameter  7  can  be 
optimized  based  on  likelihood  cross-validation  (LCV)  [12]. 

3  Empirical  Evaluation 

In  this  section,  we  experimentally  evaluate  the  performance  of  the  proposed  method  using 
in-house  face-age  datasets. 

We  use  the  face  images  recorded  under  17  different  lighting  conditions:  for  instance, 
average  illuminance  from  above  is  approximately  1000  lux  and  500  lux  from  the  front 
in  the  standard  lighting  condition,  250  lux  from  above  and  125  lux  from  the  front  in 
the  dark  setting,  and  190  lux  from  left  and  750  lux  from  right  in  another  setting  (see 
Figure  2).  Note  that  these  17  lighting  conditions  are  diverse  enough  to  cover  real-world 
lighting  conditions.  Images  were  recorded  as  movies  with  camera  at  depression  angle  15 
degrees.  The  number  of  subjects  is  approximately  500  (250  for  each  gender).  We  used  a 
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face  detector  for  localizing  the  two  eye-centers,  and  then  rescaled  the  image  to  64  x  64 
pixels.  The  number  of  face  images  in  each  environment  is  about  2500  (5  face  images  x 
500  subjects). 

As  pre-processing,  a  neural  network  feature  extractor  [14]  was  used  to  extract  100- 
dimensional  features  from  64  x  64  face  images.  The  kernel  regression  model  (1)  with  the 
following  Gaussian  kernel  was  employed  for  the  extracted  100-dimensional  data: 

Kc{ x,  x')  =  exp 

We  constructed  the  male/female  age  prediction  models  only  using  male/female  data, 
assuming  that  gender  classification  had  been  correctly  carried  out. 

We  split  the  250  subjects  into  the  training  set  (200  subjects)  and  the  test  set  (50 
subjects).  The  training  set  was  used  for  training  the  kernel  regression  model  (1),  and 
the  test  set  was  used  for  evaluating  its  generalization  performance.  For  the  test  samples 
{ (cc|e,  y(e)}”T]  taken  from  the  test  set  in  the  environment  with  strong  light  from  a  side, 
age-weighted  mean  square  error  (WMSE) 

WMSE  Tf  Ml  - 

nte  “  WageivT)2 

was  calculated  as  a  performance  measure.  The  training  test  sets  were  shuffled  5  times  in 
such  a  way  that  each  subject  was  selected  as  a  test  sample  once.  The  final  performance 
was  evaluated  based  on  the  average  WMSE  over  the  5  trials. 

We  compared  the  performance  of  the  proposed  method  with  the  two  baseline  methods: 

Baseline  method  1:  Training  samples  were  taken  only  from  the  standard  lighting  condi¬ 
tion  and  age-weighted  regularized  least-squares  (3)  was  used  for  training. 

Baseline  method  2:  Training  samples  were  taken  from  all  17  different  lighting  conditions 
and  age-weighted  regularized  least-squares  (3)  was  used  for  training. 

The  importance  weights  were  not  used  in  these  baseline  methods.  The  Gaussian  width 
a  and  the  regularization  parameter  A  were  determined  based  on  4-fold  CV  over  WMSE, 
i.e.,  the  training  set  was  further  divided  into  a  training  part  (150  subjects)  and  a  validation 
part  (50  subjects). 

In  the  proposed  method,  training  samples  were  taken  from  all  17  different  lighting 
conditions  (which  is  the  same  as  the  baseline  method  2).  The  importance  weights  were 
estimated  by  KLIEP  using  the  training  samples  and  additional  unlabeled  test  samples; 
the  hyper-parameter  7  in  KLIEP  was  determined  based  on  2-fold  LCV  [12],  We  then 
computed  the  average  importance  score  over  different  samples  for  each  lighting  condition 
and  used  the  average  importance  score  for  training  the  regression  model.  The  Gaussian 
width  a  and  the  regularization  parameter  A  in  the  regression  model  were  determined 
based  on  4- fold  IWCV  [11], 
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Table  2:  The  test  performance  measured  by  WMSE. 


Male 

Female 

Baseline  method  1 

2.83 

6.51 

Baseline  method  2 

2.64 

4.40 

Proposed  method 

2.54 

3.90 

Table  2  summarizes  the  experimental  results,  showing  that,  for  both  male  and  female 
data,  the  baseline  method  2  is  better  than  the  baseline  method  1  and  the  proposed  method 
is  better  than  the  baseline  method  2.  This  illustrates  the  effectiveness  of  the  proposed 
method.  Note  that  WMSE  for  female  subjects  is  substantially  larger  than  that  for  male 
subjects.  The  reason  for  this  would  be  that  female  subjects  tend  to  have  more  divergence 
such  as  short/long  hair  and  with/without  makeup,  which  makes  prediction  harder  [16]. 

4  Summary  and  Future  Works 

Lighting  condition  change  is  one  of  the  critical  causes  of  performance  degradation  in  age 
prediction  from  face  images.  In  this  paper,  we  proposed  to  employ  a  machine  learning 
technique  called  covariate  shift  adaptation  for  alleviating  the  influence  of  lighting  condition 
change.  We  demonstrated  the  effectiveness  of  our  proposed  method  through  real-world 
perceived  age  prediction  experiments. 

In  the  experiments  in  Section  3,  test  samples  were  collected  from  a  particular  lighting 
condition,  and  samples  from  the  same  lighting  condition  were  also  included  in  the  training 
set.  Although  we  believe  this  setup  to  be  practical,  it  would  be  interesting  to  evaluate 
the  performance  of  the  proposed  method  when  no  overlap  in  the  lighting  conditions  exists 
between  training  and  test  data. 

In  principle,  the  covariate  shift  framework  allows  us  to  incorporate  not  only  lighting 
condition  change,  but  also  various  types  of  environment  change  such  as  face  pose  variation 
and  camera  setting  change.  In  our  future  work,  we  will  investigate  whether  the  proposed 
approach  is  still  useful  in  such  challenging  scenarios. 

Recently,  novel  approaches  to  density  ratio  estimation  for  high-dimensional  problems 
have  been  explored  [5,  10,  17,  13].  In  our  future  work,  we  would  like  to  incorporating 
these  new  ideas  into  our  framework  of  perceived  age  estimation,  and  see  how  the  prediction 
performance  can  be  further  improved. 
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